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Abstract 

Submodular functions are relevant to machine learning for mainly two 
reasons: (1) some problems may be expressed directly as the optimiza- 
tion of submodular functions and (2) the Lovasz extension of submodu- 
lar functions provides a useful set of regularization functions for super- 
vised and unsupervised learning. In this paper, we present the theory 
of submodular functions from a convex analysis perspective, presenting 
tight links between certain polyhedra, combinatorial optimization and 
convex optimization problems. In particular, we show how submodular 
function minimization is equivalent to solving a wide variety of convex 
optimization problems. This allows the derivation of new efficient al- 
gorithms for approximate submodular function minimization with the- 
oretical guarantees and good practical performance. By listing many 
examples of submodular functions, we review various applications to 
machine learning, such as clustering or subset selection, as well as a 
family of structured sparsity-inducing norms that can be derived and 
used from submodular functions. 
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Introduction 



Many combinatorial optimization problems may be cast as the min- 
imization of a set-function, that is a function defined on the set of 
subsets of a given base set V. Equivalently, they may be defined as 
functions on the vertices of the hyper-cube, i.e, {0, 1}^ where p is 
the cardinality of the base set V — they are then often referred to as 
pseudo-boolean functions [15]. Among these set-functions, submodular 
functions play an important role, similar to convex functions on vector 
spaces, as many functions that occur in practical problems turn out to 
be submodular functions or slight modifications thereof, with applica- 
tions in many areas areas of computer science and applied mathematics, 
such as machine learning [861 11051 IHOl [85] , computer vision [181 E2] , op- 
erations research (63^ 1118] or electrical networks [llOj . Since submodu- 
lar functions may be minimized exactly, and maximized approximately 
with some guarantees, in polynomial time, they readily lead to efficient 
algorithms for all the numerous problems they apply to. 

However, the interest for submodular functions is not limited to dis- 
crete optimization problems. Indeed, the rich structure of submodular 
functions and their link with convex analysis through the Lovasz exten- 
sion |92] and the various associated polytopes makes them particularly 
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2 Introduction 



adapted to problems beyond combinatorial optimization, namely as 
regularizers in signal processing and machine learning problems |21t |6] . 
Indeed, many continuous optimization problems exhibit an underlying 
discrete structure, and submodular functions provide an efficient and 
versatile tool to capture such combinatorial structures. 

In this paper, the theory of submodular functions is presented, in 
a self-contained way, with all results proved from first principles of 
convex analysis common in machine learning, rather than relying on 
combinatorial optimization and traditional theoretical computer sci- 
ence concepts such as matroids. A good knowledge of convex analysis 
is assumed (see, e.g., [IZllIBj) and a short review of important concepts 
is presented in Appendix \K\ 

Paper outline. The paper is organized in several sections, which are 
summarized below: 

(1) Definitions: In Section [TJ we give the different definitions 
of submodular functions and of the associated polyhedra. 

(2) Lovasz extension: In Section [21 we define the Lovasz ex- 
tension and give its main properties. In particular we present 
the key result in submodular analysis, namely, the link be- 
tween the Lovasz extension and the submodular polyhedra 
through the so-called "greedy algorithm". We also present 
the link between sparsity-inducing norms and the Lovasz ex- 
tensions of non-decreasing submodular functions. 

(3) Examples: In Section EJ we present classical examples of 
submodular functions, together with the main applications 
in machine learning. 

(4) Polyhedra: Associated polyhedra are further studied in Sec- 
tion HI where support functions and the associated maximiz- 
ers are computed. We also detail the facial structure of such 
polyhedra, and show how it relates to the sparsity-inducing 
properties of the Lovasz extension. 

(5) Separable optimization - Analysis: In Section [5l we 
consider separable optimization problems regularized by the 
Lovasz extension, and show how this is equivalent to a se- 
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quence of submodular function minimization problems. This 
is the key theoretical link between combinatorial and convex 
optimization problems related to submodular functions. 

(6) Separable optimization - Algorithms: In Section [6l we 
present two sets of algorithms for separable optimization 
problems. The first algorithm is a an exact algorithm which 
relies on the availability of a submodular function mini- 
mization algorithm, while the second set of algorithms are 
based on existing iterative algorithms for convex optimiza- 
tion, some of which come with online and offline theoretical 
guarantees. 

(7) Submodular function minimization: In Section [71 we 
present various approaches to submodular function mini- 
mization. We present briefly the combinatorial algorithms 
for exact submodular function minimization, and focus in 
more depth on the use of specific convex separable optimiza- 
tion problems, which can be solved iteratively to obtain ap- 
proximate solutions for submodular function minimization, 
with theoretical guarantees and approximate optimality cer- 
tificates. 

(8) Submodular optimization problems: in Section El we 
present other combinatorial optimization problems which can 
be partially solved using submodular analysis, such as sub- 
modular function maximization and the optimization of dif- 
ferences of submodular functions, and relate these to non- 
convex optimization problems on the submodular polyhedra. 

(9) Experiments: in Section [9l we provide illustrations of 
the optimization algorithms described earlier, for sub- 
modular function minimization, as well as for convex 
optimization problems (separable or not). The Mat- 
lab code for all these experiments may be found at 
|http : //www . di . ens . f r/~f bach/ submodular/ 

In Appendix [XJ we review relevant notions from convex analysis 
and convex optimization, while in Appendix |Bl we present several re- 
sults related to submodular functions, such as operations that preserve 
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submodularity. 

Several books and paper articles already exist on the same topic 
and the material presented in this paper rely mostly on those |49t llim 
11331 187]. However, in order to present the material in the simplest way, 
ideas from related research papers have also been used. 

Notation. We consider the set V = {1, . . . ,p}, and its power set 2^, 
composed of the 2^ subsets of V. Given a vector s G W, s also denotes 
the modular set-function defined as s{A) = J2keA '^k- Moreover, A C B 
means that A is a subset of B, potentially equal to B. For q G [1, +oo], 
we denote by \\w\\q the iq-norm of u), by |^| the cardinality of the set 
A, and, for ^ C ^ = {1, . . . 1a denotes the indicator vector of the 
set A. li w £ W, and a G M, then {w ^ q} (resp. {w > a}) denotes 
the subset oi V = {1, . . . ,p} defined as {k £ V, Wk ^ a} (resp. {k G 
V, Wk > a}), which we refer to as the weak (resp. strong) a-sup-level 
sets of w. Similarly if v G W, we denote {w v} = {k £ V, ^ ffc}. 
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Definitions 



Throughout this paper, we consider V = {l,...,p}, p > and its 
power set (i.e., set of aU subsets) 2^, which is of cardinality 2^. We 
also consider a real-valued set-function F : 2^ — )• M such that F{0) = 
0. As opposed to the common convention with convex functions (see 
Appendix we do not allow infinite values for the function F. 

1.1 Equivalent definitions of submodularity 

Submodular functions may be defined through several equivalent prop- 
erties, which we now present. 



Definition 1.1 (Submodular function). A set-function F : 2^ ^ 
M is submodular if and only if, for all subsets A,B C V, we have: 
F{A) + F{B) ^ F{A UB) + F{A n B). 



The simplest example of submodular function is the cardinality (i.e., 
F(A) = \A\ where |^| is the number of elements of A), which is both 
submodular and supermodular (i.e., its opposite is submodular), which 
we refer to as modular. 



5 



6 Definitions 



From Def. II. !( it is clear that the set of submodular functions is 
closed under linear combination and multiplication by a positive scalar. 
Checking the condition in Def. ll.ll is not always easy in practice; it turns 
out that it can be restricted to only certain sets A and B, which we 
now present. 

The following proposition shows that a submodular has the "dimin- 
ishing return" property, and that this is sufficient to be submodular. 
Thus, submodular functions may be seen as a discrete analog to concave 
functions. However, as shown in Section[2l in terms of optimization they 
behave more like convex functions (e.g., efficient minimization, duality 
theory, links with convex Lovasz extension). 



Proposition 1.1. (Definition with first order differences) The 

set-function F is submodular if and only if for all A,B C V and k £ V, 
such that A C B and k ^ B, we have F{A U {k}) - F{A) ^ F{B U 
{k})-F{B). 



Proof. Let A C 5, and ^ B, F{Au{k})-F{A)-F{BU{k})+F{B) = 
F{C) + F{D) - F{C UD)- F{C n D) with C = AU {k} and D = B, 
which shows that the condition is necessary. To prove the opposite, we 
assume that the condition is satisfied; one can first show that if ^ C -B 
and C n B = 0, then F{A U C) - F{A) ^ F{B U C) - F{B) (this 
can be obtained by summing the m inequalities F{A U {ci, . . . , Cfc}) — 
F{Au{ci,...,Ck-i}) > F{BU{ci,...,Ck}) - F{BU{ci,...,Ck-i}) 
where C = {ci, . . . ,Cm})- 

Then, for any X,Y CV, take A = X nY, C = X\Y and B = Y 
(which implies AuC = X and BUC = XUY) to obtain F{X)+F{Y) ^ 
F{X UY) + F{X n Y), which shows that the condition is sufficient. □ 

The following proposition gives the tightest condition for submod- 
ularity (easiest to show in practice). 



Proposition 1.2. (Definition v^ith second order differences) 

The set-function F is submodular if and only if for all ^4 C V and 
j,ke V\A, we have F {AVJ {k]) - F {A) > F{A[j{j,k]) - F{AiJ{j}). 
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Proof. This condition is weaker than the one from previous proposition 
(as it corresponds to taking B = AU {j}). To prove that it is stih 
sufficient, simply apply it to subsets AU {bi, . . . ,bs-i}, j = bs for 
B = AL) {bi, . . . , bm} ^ A with k ^ B, and sum the m inequalities 



F{Au{bu . . . , bs-i}U{k})-F{AU{bi,. . . , bs-i} ) ^ F{Au{bi, bs}U 



{k}) — F{A U {bi, . . . , bs}), to obtain the condition in Prop. 11.11 □ 

In order to show that a given set-function is submodular, there are 
several possibilities: (a) using Prop. 11.21 directly, (b) use the Lovasz 
extension (see Section [2]) and show that it is convex, (c) cast it as a 
special case from Section [3] (typically a cut or a flow), or (d) use known 
operations on submodular functions presented in Appendix IB. 21 

1.2 Associated polyhedra 

A vector s G naturally leads to a modular set-function defined as 
^{^) = Ylk&A = s'^^A, where 1a G is the indicator vector of the 
set A. We now define specific polyhedra in M^. These play a crucial role 
in submodular analysis, as most results may be interpreted or proved 
using such polyhedra. 



Definition 1.2 (Submodular and base polyhedra). Let F be a 

submodular function such that F{0) = 0. The submodular polyhe- 
dron P{F) and the base polyhedron B{F) are defined as: 



As shown in the following proposition, the submodular polyhedron 
P{F) has non-empty interior and is unbounded. Note that the other 
polyhedron (the base polyhedron) will be shown to be non-empty and 
bounded as a consequence of Prop. 12.21 It has empty interior since it 
is included in the subspace s{V) = F{V). See Figure [TT] for examples 
with p = 2 and p = 3. 



PiF) 
B{F) 



{s G RP, C y, s{A) ^ F{A)] 
{s G W, s{V) = F{V), C y, s 
P{F)r^{s{V) = F{V)}. 



{A) ^ F{A)] 
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Fig. 1.1: Submodular polyhedron P{F) and base polyhedron B{F) for 
p = 2 (left) and p = 3 (right), for a non-decreasing submodular function 
(for which B{F) C M^, see Prop. O]). 



Proposition 1.3. (Properties of submodular polyhedron) Let 

F be a submodular function such that F{0) = 0. If s G P{F), then 
for all t & W , such that t ^ s, we have t G P{F). Moreover, P{F) has 
non-empty interior. 



Proof. The first part is trivial, since t ^ s implies that for all A C 
V, t(A) ^ s{A). For the second part, we only need to show that 
P{F) is non-empty, which is true since the constant vector equal to 
miuAcV, A^0 belongs to P(F). □ 

1.3 Polymatroids (non-increasing submodular functions) 

When the submodular function F is also non- decreasing, i.e., when for 
A,B C V, A C B ^ F{A) ^ F{B), then the function is often referred 
to as a polymatroid rank function (see related matroid rank functions 
in Section [3^ . For theses functions, the base polyhedron is included 
in the positive orthant, and this is in fact a characteristic property. 
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Proposition 1.4. (Base polyhedron and polymatroids) Let F 

be a submodular function such that F{0) = 0. The function F is non- 
decreasing, if and only if the base polyhedron is included in the positive 
orthant W]. 



Proof. The simplest proof uses the greedy algorithm from Section 12.21 
We have from Prop. miHg^^f^p-^ = — maXj,g5(^)(— l|fc}.)^s = 
= F{V) - F{V\{k}). Thus, B{F) C if and only if 
for ah k eV, F{V) - F{y\{k}) ^ 0. Since, by submodularity for ah 
^ C y and A; ^ ^, F{A[j{k})- F{A) ^ F (V) - F {V\{k}) , B{F) C 
if and only if F is non-decreasing. □ 

For polymatroids, another polyhedron is often considered, the sym- 
metric independence polyhedron, which we now define. This polyhe- 
dron will turn out to be the unit ball of the dual norm of the norm 
defined in Section 12.31 (see more details and figures in Section 12. 3p . 



Definition 1.3 (Symmetric independence polyhedron). Let F 

be a non-decreasing submodular function such that F[0) = 0. The 
submodular polyhedron |P|(F) is defined as: 

= {s e RP, c V, \s\{A) ^ F{A)} = {s e W, \s\ e P{F)} 
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We first consider a set-function F such tliat F{0) = 0, which is not 
necessary submodular. We can define its Lovasz extension [92], which 
is often referred to as its Choquet integral [26]. The Lovasz extension 
allows to draw links between submodular set-functions and regular con- 
vex functions, and transfer known results from convex analysis, such 
as duality. In particular, we prove in this section, the two key results 
of submodular analysis, namely that (a) a set-function is submodular 
if and only if its Lovasz extension is convex, and (b) that the Lovasz 
extension is the support function of the base polyhedron, with a di- 
rect relationship through the "greedy algorithm". We then present in 
Section 12.31 how for non-decreasing submodular functions, the Lovasz 
extension may be used to define a structured sparsity-inducing norm. 

2.1 Definition 

We now define the Lovasz extension of any set-function (not necessary 
submodular). 

Definition 2.1 (Lovasz extension). Given a set-function F such 
that F{0) = 0, the Lovasz extension / : — )• M is defined as follows; 
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for w € MP, order the components in decreasing order wj^ ^ • • • ^ Wj^, 
and define f{w) through any of the following equivalent equations: 

p 

f{w) = Y.Wj,[Fi{h,...,jk})-F{{ju...Jk-i})], (2.1) 
k=i 
p-i 

f{w)= + (2.2) 

k=l 
r+oo 

f{w) = / F{{w ^ z})dz + F{V) mm{wi, Wp}, (2.3) 

J min{toi,...,tOp} 
f-+oo rO 

fiw) = / F{{w ^ z})dz + / [F{{w ^ z}) - F{V)]dz. (2.4) 



Proof. To prove that we actually define a function, one needs to prove 
that the definitions are independent of the potentially non unique 
ordering Wj^ ^ • • • ^ Wj^ , which is trivial from the last formula- 
tion in Eq. (j2.4p . The first and second formulations in Eq. (j2.ip and 
Eq. p.2p are equivalent (by integration by parts, or Abel summation 
formula). To show equivalence with Eq. (I2.3p . one may notice that 
z I—)- F{{w ^ z}) is piecewise constant, with value zero for z > Wj^ = 
m.ay.{wi,. . . ,Wp}, and equal to -F({ii, . . . , jfc}) for z £ {wj^^^,Wj^), 
= {1, . . . ,p — 1}, and equal to FiV) for z < wj^ = mm{wi, . . . , Wp}. 
What happens at break points is irrelevant for integration. 

To prove Eq. (|2.4p from Eq. (|2.3p . notice that for a ^ 
min{0, wi, . . . , Wp}, Eq. (j2.3p leads to 



+00 



F{{w ^ z})dz 



min{toi,...,Wp} 



F{{w ^ z})dz 
+F(y) min{wi, . . , 



,Wp} 



+00 



F{{w ^ z})dz 



min{wi,...,Wp} 



F{V)dz 

fmin{-!iii, 



+ 



,Wp} 



F{V)dz 



+00 



F{{w ^ z})dz 



F{V)dz, 



and we get the result by letting a tend to — cx). 
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Note that for modular functfons A i— t- s{A), with s G M^, then the 
Lovasz extension is the Unear function w i— >• s. Moreover, for p = 2, 
we have 

f{w) = ^[F{{l})+F{{2})-F{{l,2})]-\w,-W2\ 
+i[F({l})-F({2}) + F({l,2})].u;i 

+ l[-F({l}) + F{{2})+F{{l,2})]-W2 

= - [F({1}) + F{{2}) - F{{1, 2})] mm{wi,W2} 

+Fi{l})wi + Fi{2})w2, 

which alfows an ihustration of various propositions in this section (in 
particular Prop. [27T]) . 

The following proposition details classical properties of the Cho- 
quet integral/Lovasz extension. In particular, property (e) below im- 
plies that the Lovasz extension is equal to the original set-function on 
{0, 1}^ (which can canonically be identified to 2^), and hence is indeed 
an extension of F. See an illustration in Figure ET] for p = 2. 

Proposition 2.1. (Properties of Lovasz extension) Let F be any 

set-function such that F{0) = 0. We have: 

(a) if F and G are set-functions with Lovasz extensions / and g, then 
f + g IS the Lovasz extension of F + G, and for all A G M, A/ is the 
Lovasz extension of XF, 

(b) for w G Rl, f{w) = F{{w ^ z})dz, 

(c) if F{V) = 0, for ah w G W, f{w) = f^^ F{{w ^ z})dz, 

(d) for all G MP and a G M, f{w + aly) = f{w) + aF{V), 

(e) the Lovasz extension / is positively homogeneous, 

(f) for ah ACV, F{A) = /(U), 

(g) if F is symmetric (i.e., WA C V, F(A) = F{V\A)), then / is even, 

(h) if V = ^1 U • • • U Am is a partition of V, and w = YlT^i '^i'^Ai 
(i.e., w is constant on each set j4j), with f i ^ • • • ^ fmj then f{w) = 
YZi\vi - v^+l)F{Al Ai) + Vm+iFiV). 

Proof. Properties (a), (b) and (c) are immediate from Eq. (12. 4p and 
Eq. ()2.2p . Properties (d), (e) and (f) are straightforward from Eq. ()2.2|) . 
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Fig. 2.1: Lovasz extension for V = {1,2}: the function is piecewise 
affine, with different slopes for wi ^ W2, with values F{{l})wi + 
[F({1,2}) - F{{l})]w2, and for wi ^ W2, with values F{{2})w2 + 
[F({1,2}) - F{{2})]wi. The level set {w G R^Jiw) = 1} is displayed 
in blue, together with points of the form jttjvIa- 



If F is symmetric, then F{V) = F{0) = 0, and thus f{—w) = 
^ z})dz = F{{w ^ -z])dz = F{{w ^ z])dz = 
F{{w > z})dz = f{w) (because we may replace strict inequalities 
by regular inequalities), i.e., / is even. Finally, property (h) is a direct 
consequence of Eq. p.3p . □ 

Note that when the function is a cut function (see Section r3.2p . then 
the Lovasz extension is related to the total variation and property (c) is 
often referred to as the co-area formula (see |21] and references therein, 
as well as Section [32]) ■ 



Decomposition into modular plus non-negative function. 

Given any submodular function G and an element t of the base poly- 
hedron B{G) defined in Def. 11.21 then the function F = G — t \s also 
submodular, and is such that F is always non-negative and F[V) = 0. 
Thus G may be (non uniquely) decomposed as the sum of a modular 
function t and a submodular function F which is always non-negative 
and such that FiV) = 0. Such functions F have interesting Lovasz 
extensions. Indeed, for all w G M^, f{w) ^ and f{w + aly) = f{w)- 
Thus in order to represent the level set {f{w) = 1}, we only need to 
project onto a subspace orthogonal to ly. In Figure [221 we consider a 
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W[=W2 




Fig. 2.2: Top: Polyhedral level set of / (projected on the set w~^lv = 0), 
for 2 different submodular symmetric functions of three variables, with 
different inseparable sets leading to different sets of extreme points; 
changing values of F may make some of the extreme points disap- 
pear (see Section 14.21 for a discussion of inseparable sets and faces of 
this polytope). The various extreme points cut the space into polygons 
where the ordering of the components is fixed. Left: F{A) = l|A|e{i,2}) 
leading to f{w) = max^gji 2,3} Wk - minfcgji 2,3} Wk (all possible ex- 
treme points); note that the polygon need not be symmetric in general. 
Right: one-dimensional total variation on three nodes, i.e., F{A) = 
|li6A - UgaI + \heA - heA\, leading to f{w) = \wi - W2\ + \w2-W3\, 
for which the extreme points corresponding to the separable set {1, 3} 
and its complement disappear. 



function F which is symmetric (which implies that F{V) = and F is 
non- negative, see more details in Section [7. 4p . 

2.2 Greedy algorithm 

The next result relates the Lovasz extension with the support functior0 
of the submodular polyhedron P{F) which is defined in Def. 11.21 This 
is the basis for many of the theoretical results and algorithms related to 
submodular functions. It shows that maximizing a linear function with 
non-negative coefficients on the submodular polyhedron may be ob- 
tained in closed form, by the so-called "greedy algorithm" (see [92l Il2] 



The support function is obtained by maximizing linear functions; see definition in Ap- 
pendix |A] 



2.2. Greedy algorithm 15 



and Section 13.81 for an intuitive explanation of this denomination in the 
context of matroids), and the optimal value is equal to the value f{w) 
of the Lovasz extension. Note that otherwise, solving a linear program- 
ming problem with 2^ — 1 constraints would then be required. This 
applies to the submodular polyhedron P{F) and to the base polyhe- 
dron B{F); note the different assumption regarding the positivity of 
the components of w. 



Proposition 2.2. (Greedy algorithm for submodular and base 
polyhedra) Let F be a submodular function such that F(0) = 0. 
Let u; G M^, with components ordered in decreasing order, i.e., u;^^ ^ 
• • • ^ Wj^ and define sj^ = F{{ji, . . . ,jk}) - F{{ji, . . . ,jk-i})- Then 
se B{F) and, 

(a) if w ^ M.^, s is a maximizer of max^gp^^) ty^s, and 
max,,^P(^F) w'^s = f{w), 

(b) s is a maximizer of maxg^^i^p^ uP' s, and max^g^i-^) s = f{w). 



Proof. By convex duality (which applies because P{F) has non empty 
interior from Prop. [L3]l . we have, by introducing Lagrange multipliers 
Xa S K+ for the constraints s{A) ^ F{A), A G V, the following pair 
of convex optimization problems dual to each other: 



If we take the (primal) candidate solution s obtained from the greedy 
algorithm, we have /{w) = w'^ s from Eq. (|2.1|) . We now show that 
s is feasible (i.e., in P(F)), as a consequence of the submodularity of 
F. Indeed, without loss of generality, we assume that jk = k for all 
k £ {1, • • • ,p}- We can decompose any subset of {1, . . . ,^3} as ^ = 




(2.5) 
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AiL) ■ ■ - L) Am, where A^ = {uk,Vk] are integer intervals. We then have: 

m 

s{A) = by modularity 

k=l 
m 

= Y,{F{{Q,Vk])-F{{^,Uk])} 

k=l 
m 

^ ^ {F{{ui,Vk]) - F{{ui,Uk])] by submodularity 
fc=i 

m 

k=2 
m 

^ F{{uuvi]) + {F{{uuvi] U {U2,vk\) - F{{ui,vi] U {u2,Uk\)} 
k=2 

by submodularity 

= F{{ui,Vi] U {U2,V2]) 
m 

+ Y{F{{ui,vi] U {u2,Vk]) - F{{ui,vi] U {u2,Uk])]. 
k=3 

By pursuing applying submodularity, we finally obtain that s{A) ^ 
F{{ui,vi] U • • • U {um, Vm]) = F{A), i.e., s G P{F). 

Moreover, we can define dual variables ^{ji,...,jf,} = wj^^ — Wj^^^ for 
A; E {1, . . . ,p — 1} and Ay = Wj^ with all other Xa equal to zero. Then 
they are all non negative (notably because ^ 0), and satisfy the 
constraint V/c € Wk = X^Agfc Finally, the dual cost function has 
also value f{w) (from Eq. ()2.2p ). Thus by duality (which holds, because 
P{F) has a non-empty interior), s is an optimal solution. Note that it 
is not unique (see Prop. for a description of the set of solutions). 

In order to show (b), we may first assume that w ^ 0, we may 
replace P{F) by B{F), by simply dropping the constraint Ay ^ in 
Eq. ()2.5p . Since the solution obtained by the greedy algorithm satis- 
fies s{V) = F{V), we get a pair of primal-dual solutions, hence the 
optimality. 

The result generalizes to all possible w, because we may add a large 
constant vector to w, which does not change the maximization with 
respect to B{F) (since it includes the constraint s{V) = F{V)). □ 
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The next proposition draws precise links between convexity and 
submodularity, by showing that a set-function F is submodular if and 
only if its Lovasz extension / is convex [^. This is further developed 
in Prop. [23] where it is shown that, when F is submodular, minimizing 
F on 2^ (which is equivalent to minimizing / on {0, 1}^ since / is an 
extension of F) and minimizing / on [0, 1]^ are equivalent. 



Proposition 2.3. (Convexity and submodularity) A set-function 
F is submodular if and only if its Lovasz extension / is convex. 



Proof. Let A,B CV. The vector l^uB + ^AnB = 1a + 1b has compo- 
nents equal to (on V\{AU B)), 2 (on An B) and 1 (on A/\B = 
{A\B) U {B\A)). Therefore, /(UuB + Unfi) = F{l^^^,})dz = 
Jl F{A U B)dz + Jl F{A n B)dz = F{A U B) + F{A n B). 

If / is convex, then by homogeneity, /{Ia + Is) ^ /(1a) + /(Ifi), 
which is equal to F{A) + F{B), and thus F is submodular. 

If F is submodular, then by Prop. 12.21 for all w E R^, f{w) is 
a maximum of linear functions, thus, it is convex on R^. Moreover, 
because /(w -|- aly) = f{w) + Q-FiV), it is convex on R^. □ 

The next proposition completes Prop. 12.31 by showing that mini- 
mizing the Lovasz extension on [0, 1]'^' is equivalent to minimizing it on 
{0, 1}^, and hence to minimizing the set-function F on 2^ (when F is 
submodular). 



Proposition 2.4. (Minimization of submodular functions) 

Let F be a submodular function and / its Lovasz extension; then 
miuAcV F{A) = min^g|o,i}p f{w) = min^e[o_i]p f{w). 



Proof. Because / is an extension from {0, 1}''' to [0, 1]^ (property (d) 
from Prop. 12. ID . we must have m.\n.Ac_v = ™iiiu)e{o,i}J' /(^) ^ 

min„^gjo,i]p /(^)- For the other inequality, any w G [0,1]^ may be de- 
composed as It; = Yl^i=i ^i^Bi where Bi d ■ ■ ■ <Z Bp = V , where A is 
nonnegative and has a sum smaller than or equal to one (this can be 
obtained by considering Bi the set of indices of the i largest values of 
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w). We then have f{w) = FiBi)dz = >^^^'iB^) > 

Y^f^-^^XiUimAcV F{A) ^ minAcy-^(^) (because min^cy-^(^) ^ 0). 
This leads to the desired result. 

Note that the last equality shows that the minimizers of /{w) on 
w G [0, 1]^ must have sup- level sets (i.e., the sets Bi defined above) 
which are minimizers of F (i.e., w is a convex hull of the indicator 
vectors of all minimizers of F). □ 

We end this section, by simply stating the greedy algorithm for 
the symmetric independence polyhedron, whose proof is similar to the 
proof of Prop. 12.21 (we define the sign of a as +1 if a > 0, and — 1 
if a < 0, and zero otherwise; \w\ denotes the vector composed of the 
absolute values of the components of w) . 



Proposition 2.5. (Greedy algorithm for symmetric indepen- 
dence polyhedron) Let F be a submodular function such that 
F{0) = and F is non-decreasing. Let w S MP. A maximizer of 
maXgg|p|(p) w''^s may be obtained by the following algorithm: order 
the components of \w\, as \wjj^\ ^ ••• ^ I'^^jpl ^iid define Sj,, = 
sign{wjJ[F{{ji,. . . ,jk})-F{{ji, . . . ,jk-i})]- Moreover, for all w £ W, 
max^g|P|(P) w~^s = f{\w\). 



2.3 Structured sparsity and convex relaxations 

Structured sparsity. The concept of parsimony is central in many 
scientific domains. In the context of statistics, signal processing or ma- 
chine learning, it takes the form of variable or feature selection prob- 
lems. 

In a supervised learning problem, we aim to predict n responses 
Ui G M, from n observations G MP, for z G {1, . . . , n}. In this paper, we 
focus on linear predictors of the form f{x) = w'^x, where i/; G M** (for 
extensions to non- linear predictions, see [HE] and references therein). 
We consider estimators obtained by the following regularized empirical 
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risk minimization formulation: 
1 

mm — i(yi,w~^ Xi) + \i}{w), (2-6) 
toeiRp n ^-^ 
1=1 

where i{y, y) is a loss between a prediction y and the true response y, 
and is a norm. Typically, the quadratic loss £{y, y) = ^(y — y)^ is used 
for regression problems and the logistic loss y) = log(l + exp(— yy)) 
is used for binary classification problems where y E {—1, 1} (see, e.g., 
|128] and [58] for more complete descriptions of loss functions) . 

In order to promote sparsity, the £i-norm is commonly used and, in 
a least-squares regression framework is referred to as the Lasso [ISTj in 
statistics and as basis pursuit [24J in signal processing. Sparse models 
are commonly used in two situations: First, to make the model or the 
prediction more interpretable or cheaper to use, i.e., even if the under- 
lying problem does not admit sparse solutions, one looks for the best 
sparse approximation. Second, sparsity can also be used given prior 
knowledge that the model should be sparse. In these two situations, re- 
ducing parsimony to finding models with low cardinality turns out to be 
limiting, and structured parsimony has emerged as a fruitful practical 
extension, with applications to image processing, text processing, bioin- 
formatics or audio processing (see, e.g., plOl [Til [68l ITT] [82] ITHl [95| [90] , 
a review in [SllH] and Section [3] for various examples, and in particular 
Section 13.31 for relationships with grouped £i-norm with overlapping 
groups) . 

Convex relaxation of combinatorial penalty. Most of the work 
based on convex optimization and the design of dedicated sparsity- 
inducing norms has focused mainly on the specific allowed set of spar- 
sity patterns |1401 [JH [TH [76] : if w G RP denotes the predictor we aim 
to estimate, and Supp(t(;) denotes its support, then these norms are de- 
signed so that penalizing with these norms only leads to supports from 
a given family of allowed patterns. We can instead follow the approach 
of [SUl [SH] and consider specific penalty functions F(Supp(i(;)) of the 
support set Supp(i(;) = {j G y, Wj / 0}, which go beyond the cardi- 
nality function, but are not limited or designed to only forbid certain 
sparsity patterns. As first shown in [6], for non- decreasing submodular 



20 Lovasz extension 



functions, these may also lead to restricted sets of supports but their 
interpretation in terms of an explicit penalty on the support leads to 
additional insights into the behavior of structured sparsity-inducing 
norms. 

While direct greedy approaches (i.e., forward selection) to the prob- 
lem are considered in [59^ I68j. submodular analysis may be brought to 
bear to provide convex relaxations to the function w i— t- F(Supp(it;)), 
which extend the traditional link between the £i-norm and the cardi- 
nality function. 



Proposition 2.6. (Convex relaxation of functions defined 
through supports) Let F be a non-decreasing submodular function. 
The function w i— )• /(|ii'|) is the convex envelope (tightest convex lower 
bound) of the function w i— )• F(Supp(7u)) on the unit £oo-ball [—1, 1]^. 



Proof. We use the notation {wl to denote the p-dimensional vector 
composed of the absolute values of the components of w. We de- 
note by g* the Fenchel conjugate (see definition in Appendix [X]) of 
g : w ^ F(Supp{w)) on the domain {w G M^, ||w||oo ^ 1} = [—1, 1]^, 
and g** its bidual [E]- We only need to show that the Fenchel bid- 
ual is equal to the function w i— )• f{\w\). By definition of the Fenchel 
conjugate, we have: 

g*{s) = max s — g{w) 

\\w\\oo^l 

= max max (6 o w)^ s — f(S) by definition of g, 

(5e{0,l}P ||«;||cx)<l 

= max S~^\s\ — f(6) by maximizing out w, 

(5e{o,i}p 

= max 5'^\s\ — f(6) because F — \ s\ is submodular. 

Thus, for all w such that ||w||oo ^ 1, 

q**(w) = max w — q*(s) 

= max min w — 6^ \s\ + f(6) 
seKP 5e[o,i]p 

By strong convex duality (which applies because Slater's condition [T7] 
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is satisfied), we get: 

q**(w) = mill max s'^ w — \s\ + f(6) 

^ ' Se[o,i]p s&Kp ' ' ^ 

by strong duality and 

= min f(6) = f(\w\) because F is nonincr easing, 

<5e[0,l]P,5>|to| 

which leads to the desired result. Note that F non- increasing implies 
that / is non- increasing with respect to all of its components. □ 

The previous proposition provides a relationship between combina- 
torial optimization problems (involving functions of the form w i— )• 
F(Supp(t(;))) and convex optimization problems involving the Lovasz 
extension. A desirable behavior of a convex relaxation is that some of 
the properties of the original problem are preserved. In this paper, we 
will focus mostly on the allowed set of sparsity patterns (see below and 
Section 14. 3p . For more details about theroretical guarantees and appli- 
cations of submodular functions to structured sparsity, see [7] . In Sec- 
tion [3l we consider several example of submodular functions and present 
when appropriate how they translate to sparsity-inducing norms. 

Optimization for regularized risk minimization. Given the rep- 
resentation of as the maximum of linear functions (Prop. 12. Sh . we 
can easily obtain a subgradient of fi, thus allowing the use of subgradi- 
ent descent techniques (see a description in Appendix IA.2|) . However, 
these methods typically require many iterations, and given the struc- 
ture of our norms, more efficient methods are available: we describe in 
Section ISTTl proximal methods, which generalizes soft-thresholding algo- 
rithms for the £i-norm and grouped £i-norm, and can use efficiently 
the combinatorial structure of the norms. 

Structured sparsity-inducing norms and dual balls. We as- 
sume in this paragraph that F is submodular and non-decreasing, and 
such that the values on all singletons is strictly positive. The func- 
tion : w I—)- /(|w|) is then a norm [6J. Through the representa- 
tion n(w) = ma,'x.g^p(^p-^\w\'^ s = max|^|gp(p') it;^s = ma,Xg^\p\(^p-^w'^ s, 
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and the unit dual ball is the symmetric independence polyhedron 



P\{F) = {s e W, \s\ G P{F)} = {s £ W, C V,\\sa\\i ^ A} 



(see Appendix 1X1 for more details on polar sets and dual norms). 



The dual bah \P\{F) = {s G M^, Q*{s) ^ 1} is naturally character- 



same vectors w G {—1,0,1}^. See Figure [23] for examples for p = 2 
and Figure 12.41 for examples with p = 3. 

A particular feature of the unit ball of Q is that it has faces which 
are composed of vectors with many zeros, leading to structured sparsity 
(see Section [3^ for examples and Section ll?2| for more details about the 
facial structure of the symmetric independence polyhedron). However, 
as can be seen in Figures 12.31 and 12.41 there are additional extreme 
points and faces where many of the components of \w\ are equal (e.g., 
the corners of the ^oo-ball). In the context of sparsity-inducing norms, 
this has the sometimes undesirable effect of inducing vectors with many 
components of equal magnitude. As shown in |115j . this effect due to 
the £oo"iiorm in Prop. 12.61 mav be corrected by the appropriate use of 
iq-norms q G (l,oo), which we now present for the -^2-norm. 

^2-relaxations of submodular penalties. Given a non-decreasing 
submodular function such that F{{k}) > for all k £ V,we may define 
a norm G as follows: 



using the usual convention that that is equal to zero as soon as 
Wi = 0, and equal to +oo if Wi ^ and r?j = (for more details on vari- 
ational representations of any norms through squared ^2-norm, see [8]). 
As shown in |115j . this defines a norm, which shares the same sparsity- 
inducing effects as fi, without the extra singular points. Moreover, the 
optimization results presented in this paper can be used as well to de- 
rive efficient algorithms for optimization problems regularized by this 
norm. Moreover, Prop. [221 may be extended, and G is the convex enve- 
lope of the function w i— )• F(Supp(i(;))||i(;||2, or the homogeneous convex 




< for u; G {-1,0, 1}^. Thus, 
3 vectors FiSnlpM) "^ t^e 




,2 
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Fig. 2.3: Polyhedral unit ball of (top) with the associated dual unit 
ball (bottom), for 4 different submodular functions (two variables), 
with different sets of extreme points; changing values of F may make 
some of the extreme points disappear (see the notion of stable sets in 
Section l4.3p . From left to right: F{A) = |y4|^/^ (all possible extreme 
points), F{A) = \A\ (leading to the ^i-norm), ^(^4) = min{|^|,l} 
(leading to the £oo-norm), F{A) = ^l{An{2}y^0} + ^{A+ss) (leading to 
the structured norm ^{vS) = ^\w2\ + \\w\\oo)- Extreme points of the 
primal balls correspond to full-dimensional faces of the dual ball, and 
vice- versa. 

envelope (the tightest homogeneous convex lower bound) of the func- 
tion w I—)- ^F(Supp(tt;)) + llltt'lll, thus replacing the £oo-constraint by 
an ^2-penalty. 

Shaping level sets through symmetric submodular functions. 

For a non-decreasing submodular function F, we have defined a norm 
Q{w) = f{\w\), that essentially allows the definition of a prior knowl- 
edge on supports of predictors w. When using the Lovasz extension 
directly for symmetric submodular functions, then it turns out that 
the effect is on all sub-level sets {w ^ a} and not only on the sup- 
port {w 7^ 0}. Indeed, as shown in [7], the Lovasz extension is the 
convex envelope of the function w i-)- maXogR F({t(; ^ a}) on the set 
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F{A) = 
all possible extreme points 




F{A) = l{An{i}j^0} + l{An{2,3}#0} 

Q{w) = \wi\ + ||W{2,3}||oo 




4An{i,2,3}7^0} 



FiA) = 

+ l{An{2,3}7^0} + l{An{3}5^0} 
^^(■U;) = ||w||oo + ||™{2,3}l|oo + Iw^l 



Fig. 2.4: Unit balls for structured sparsity-inducing norms, with the 
corresponding submodular functions and the associated norm. 



[0, 1]^ + Mly = {w £ MP, maxfcgy Wk — min^gy ^ 1}. 

The main examples of such symmetric functions are cuts in undi- 
rected graphs, which we describe in Section 13.21 leading to the to- 
tal variation, but other examples are interesting as well for machine 
learning (see [7]). Finally, while the facial structure of the symmetric 
independence polyhedron |P|(F) was key to analysing the regulariza- 
tion properties for shaping supports, the base polyhedron B{F) is the 
proper polyhedron (see Section for more details). 



3 



Examples and applications of submodular 

functions 



We now present classical examples of submodular functions. For each of 
these, we also describe the corresponding Lovasz extensions, and, when 
appropriate, the associated submodular polyhedra. We also present ap- 
plications to machine learning, either through formulations as combi- 
natorial optimization problems of through the regularization properties 
of the Lovasz extension. We are by no means exhaustive and other ap- 
plications may be found in facility location [3ll[30l[I], game theory [H], 
document summarization [9T], social networks [HI], or clustering |107j . 

Note that in Appendix lB.2l we present several operations that pre- 
serve submodularity (such as symmetrization and partial minimiza- 
tion), which can be applied to any of the functions presented in this 
section, thus defining new functions. 

3.1 Cardinality-based functions 

We consider functions that depend only on s{A) for a certain s £ M^. 
If s = ly, these are functions of the cardinality. The next propo- 
sition shows that only concave functions lead to submodular func- 
tions, which is coherent with the diminishing return property from 



25 



26 Examples and applications of submodular functions 
Section d] (Prop. [H]). 



Proposition 3.1. (Submodularity of cardinality-based set- 
functions) If s G and g : — )• M is a concave function, then 
F : A g{s{A)) is submodular. If F : A g{s{A)) is submodular for 
all s G M*?, then g is concave. 



Proof. The function F : A g(s{A)) is submodular if and only if for 
all A C y and j,k G V\A: g{s{A) + Sk) - g{s{A)) ^ g{s{A) + Sk + 
Sj) — g{s{A) + Sj). If g is concave and a ^ 0, t i— )• g{a + t) — g{t) is 
non-increasing, hence the first result. Moreover, if t i— )• g{a + t) — g{t) is 
non-increasing for all a ^ 0, then g is concave, hence the second result. 

□ 



Proposition 3.2. (Lovasz extension of cardinality-based set- 
functions) Let s G M!J_ and g : — t- M be a concave function 
such that g(0) = 0, the Lovasz extension of the submodular function 
F : A^ g{s{A)) is equal to 

p 

fiw) = Wj^ [c/(sj, + • • • + SjJ - ff(Sii + • • • + Si,_i)]. 

k=l 

If s = V, i.e., F{A) = g{\A\), then f{w) = ELi ^iM^) - g{k - 1)]. 

Thus, for functions of the cardinality (for which s = ly), the Lovasz ex- 
tension is thus a linear combination of order statistics (i.e., r-th largest 
component of for r G {1, . . . 

Application to machine learning. In terms of set functions, con- 
sidering g{s{A)) instead of s{A) does not make a significant diff'erence. 
However, it does in terms of the Lovasz extension. Indeed, as shown 
in [7], using the Lovasz extension for regularization encourages com- 
ponents of w to be equal (see also Section I2.c{|) , and hence provides a 
convex prior for clustering or outlier detection, depending on the choice 
of the concave function g (see more details in [71[6l])- This is a situation 
where this effect has positive desired consequences. 
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Some special cases of non-decreasing functions are of interest, such 
as F{A) = \A\, for which f{w) = ly and is the ^i-norm, and 
F{A) = l\A\>o foi" which f{w) = max/^^y Wk and Q is the ^oo-norm. 
When restricted to subsets of V and then hnearly combined, we obtain 
set covers defined in Section 13.31 Other interesting examples of com- 
binations of functions of restricted weighted cardinality functions may 
be found in [mHIHH]. 

3.2 Cut functions 

Given a set of (non necessarily symmetric) weights d : V x V ^ 
define the cut as 

F{A)= Yl dik,j), 
keA, jeV\A 

which we denote d{A,V\A). Note that for a cut function and disjoint 
subsets A,B,C, we always have (see [35] for more details): 

F{AUBUC) = F{AUB) + F{AUC) + F{BUC) 

-F{A) - F{B) - F{C) + F(0) 
F{A UB) = d{A UB,{AU B)) = d{A, A^ n B^) + d{B, A^ n B^) 
^ d{A, A^) + d{B, B^) = F{A) + F{B), 

where we denote A'^ = V\A. This implies that F is sub-additive. We 
then have, for any sets A^BdV: 

F{AUB) 

= F{[AnB]u[A\B]u[B\A]) 

= F{[AnB]U [A\B]) +F{[AnB]U [B\A]) + F{[A\B] U [B\A]) 

-F{A nB)- F{A\B) - F{B\A) + F{0) 
= F{A) + F{B) + F{AAB) - F{A n B) - F{A\B) - F{B\A) 
= F{A) + F{B) - F{A r\B) + [F{AAB) - F{A\B) - F{B\A)] 
< F{A) + F{B) - F{A nB), by sub-additivity, 

which shows submodularity. Moreover, the Lovasz extension is equal 
to 

f{w)= ^ d{k,j){wk -Wj) + 
k,jev 
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Fig. 3.1: Two-dimensional grid with 4-connectivity. The cut in these 
undirected graphs lead to Lovasz extensions which are certain versions 
of total variations, which enforce level sets of w to be connected with 
respect to the graph. 



(which provides an alternative proof of submodularity owing to 
Prop. 12. 3p . Thus, if the weight function d is symmetric, then the sub- 
modular function is also symmetric and the Lovasz extension is even 
(from Prop. EH]). Examples of graphs related to such cuts (i.e., graphs 
defined on V for which there is an edge from k to j if and only if 
d{k,j) > 0) are shown in Figures l3.ll and 13.21 An interesting instance 
of these Lovasz extensions plays a crucial role in signal and image pro- 
cessing; indeed, for a graph composed of a two-dimensional grid with 
4-connectivity (see Figure 13. ip , we obtain a certain version of the total 
variation, which is a common prior to induce piecewise-constant sig- 
nals (see applications to machine learning below). In fact, some of the 
results presented in this paper were first shown on this particular case 
(see, e.g., [21] and references therein). 

Note that these functions can be extended to cuts in hypergraphs, 
which may have interesting applications in computer vision [18] . More- 
over, directed cuts (i.e., when d{k,j) and d(j, k) may be different) may 
be interesting to favor increasing or decreasing jumps along the edges 
of the graph. Finally, there is another interesting link between directed 
cuts and isotonic regression (see, e.g., [HS] and references therein), which 
corresponds to solving a separable optimization problem regularized by 
a large constant times the associated Lovasz extension. See another link 
with isotonic regression in Section 15.41 
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Interpretation in terms of quadratic functions of indicator 
variables. For undirected graphs (i.e., for which the function d is 
symmetric), we may rewrite the cut as follows: 

^(^) = \i2i2d{k,jmA)k-{iA),\ 
k=i j=i 

k=i j=i 

because |(1a)a: — (lyl)jP S {0, 1}. This leads to 

^(^) = ^EE(U)fc(U),[l,=fcX]d(i,fc)-d(j,fc)] 

k=l j=l i=l 

with Q = Diag(L'l) — D where D is the square weighted affinity matrix 
obtained from d, which has non-positive diagonal elements {Q is the 
Laplacian of the graph [27]). It turns out that a sum of linear and 
quadratic functions of 1a is submodular only in this situation. 



Proposition 3.3. (Submodularity of quadratic functions) Let 

Q e RP^P and q G M^. Then the function F : A ^ q'^lA + ^1a<5U 
is submodular if and only if all off-diagonal elements of Q are non- 
positive. 



Proof. Since cuts are submodular, the previous developments show that 
the condition is sufficient. It is necessary by simply considering the 
inequality ^ F{{i}) + F{{j}) - F{{i,j}) = qi + ^Qa + qj + ^Qjj - 

[qi ~\~ qj ~\~ 2Qii ~l~ 2^ii ~^ Qij] — Qij- i-i 

Regular functions and robust total variation. By partial min- 
imization, we obtain so-called regular functions [THIET]. One applica- 
tion is "noisy cut functions" : for a given weight function d : W xW ^ 
M+, where each node in W is uniquely associated in a node in V, 
we consider the submodular function obtained as the minimum cut 
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adapted to A in the augmented graph (see top-right plot of Fig- 
ure ES]): F{A) = miuBcW T,keB, jew\Bd{k,j) + X\AAB\, where 
AAB = {A\B) U {B\A) is the symmetric difference between sets A 
and B. This allows for robust versions of cuts, where some gaps may 
be tolerated; indeed, compared to having directly a small cut for A, 
B needs to have a small cut and be close to A, thus allowing some 
elements to be removed or added to A in order to lower the cut (see 
more details in [7]). 

The class of regular functions is particularly interesting, because 
it leads to a family of submodular functions for which dedicated fast 
algorithms exist. Indeed, minimizing the cut functions or the partially 
minimized cut, plus a modular function defined hy z £ W, may be 
done with a min-cut/max-flow algorithm (see, e.g., [29] )■ Indeed, fol- 
lowing [IHlEl], we add two nodes to the graph, a source s and a sink t. 
All original edges have non-negative capacities d{k,j), while, the edge 
that links the source s to the node k G V has capacity izk)+ and the 
edge that links the node k G V to the sink t has weight — (zfc)_ (see 
bottom line of Figure [32]) • Finding a minimum cut or maximum flow 
in this graph leads to a minimizer oi F — z. For a detailed study of the 
expressive power of functions expressible in terms of graph cuts, see, 
e.g., [I1I1122]. 

For proximal methods, such as defined in Eq. (|5.5p (Section [5]), 
we have z = ip{a) and we need to solve an instance of a parametric 
max-flow problem, which may be done using efficient dedicated algo- 
rithms [SU E21 EI] ■ See also Section 17.31 for generic algorithms based on 
a sequence of singular function minimizations. 

Applications to machine learning. Finding minimum cuts in 
undirected graphs such as two-dimensional grids or extensions thereof 
in more than two dimesions has become an important tool in computer 
vision for image segmentation, where it is commonly referred to as graph 
cut techniques (see, e.g., [SI] and references therein). In this context, 
several extensions have been considered, such as multi-way cuts, where 
exact optimization is not possible anymore, and a sequence of binary 
graph cuts is used to find an approximate minimum (see also |108j for a 
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Fig. 3.2: Top: directed graph (left) and undirected corresponding to 
regular functions (which can be obtained from cuts by partial mini- 
mization; a set A C V is displayed in red, with a set B C W with 
small cut but one more element than A, see text in Section [3.21 for de- 
tails). Bottom: graphs corresponding to the s — t min-cut formulation 
for minimizing the submodular function above plus a modular function 
(see text for details). 

specific multi-way extension based on different submodular functions). 

The Lovasz extension of cuts in an undirected graph, often referred 
to as the total variation, has now become a classical regularizer in sig- 
nal processing and machine learning: given a graph, it will encourages 
solutions to be piecewise-constant according to the graph (as opposed 
to the graph Laplacian, which will impose smoothness along the edges 
of the graph) [65\ IM] . See Section 14.21 for a formal description of the 
sparsity-inducing properties of the Lovasz extension; for chain graphs, 
we obtain usual piecewise constant vectors, and the have many applica- 
tions in sequential problems (see, e.g., [ST j 1132^ [Ml [2T] and references 
therein). Note that in this context, separable optimization problems 
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considered in Section [5] are heavily used and that algorithms presented 
in Section [6] provide unified and efficient algorithms for all these situa- 
tions. 

3.3 Set covers 

Given a non-negative set-function D : 2^ — )• M_|_, then we can define a 
set-function F through 

F{A) = J2 ^(^)' 

GCV, GnAy^0 

with Lovasz extension/(tt;) = "^Q^y D{G) maxkizQ Wk- 

The submodularity and the Lovasz extension can be obtained us- 
ing linearity and the fact that the Lovasz extension of A '^GnA^0 
is w ^ maxfcgG tDfc. In the context of structured sparsity-inducing 
norms (see Section 12. 3p . these correspond to penalties of the form 
w I—)- = X^GcV -^(^)II^gIIoo) thus leading to overlapping group 

Lasso formulations (see, e.g., [Ml IH E3 EH [H21 ES] ) ■ For example, 
when D[G) = 1 for elements of a given partition, and zero otherwise, 
then F{A) counts the number of elements of the partition with non- 
empty intersection with A. This leads to the classical non-overlapping 
grouped £i/^oo-norm. 

Mobius inversion. Note that any set-function F may be written as 

F{A) = ^(^) = E ^(^) - E ^(^)' 

GCV, GnA^0 GCV GCV\A 

i.e.,F{V)-FiV\A) = ^DiG), 

GcA 

for a certain set-function D, which is not usually non-negative. Indeed, 
by Mobius inversion formuls0 (see, e.g., |47j). we have: 

D{G) = (-1)1^1-1^1 [F{V) - F{V\A)] . 
AcG 



^If F and G are any set functions such that \/A C V, F{A) = Y,B(zA G{B), then VA C V, 
G{A) = j:g^^{-l)\^\B\FiB). 
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Thus, functions for which D is non- negative form a specific subset 
of submodular functions (note that for all submodular functions, the 
function D{G) is non-negative for all pairs G = for j 7^ i, as a 

consequence of Prop. I1.2p . Moreover, these functions are always non- 
decreasing. For further links, see [l9], where it is notably shown that 
D{G) = for all sets G of cardinality greater or equal to three for cut 
functions (which are second-order polynomials in the indicator vector). 

Reinterpretation in terms of set-covers. Let W be any "base" 
set. Given for each k £ V, a set C W, we define the cover 
as F{A) = \[jk&A^k\- More generally, we can define F{A) = 
Y.j&w '^U)'^^k&A,SkBj^ if we have weights A(j) € M+ for j G W (this 
corresponds to replacing the cardinality function on W, by a weighted 
cardinality function, with weights defined by A). Then, F is submod- 
ular (as a consequence of the equivalence with the previously defined 
functions, which we now prove). 

These two types of functions are in fact equivalent. Indeed, for a 
weight function D : 2^ ^ we consider the base set W to be 
the power-set of V, i.e., W = 2^, and Sk = {G C V,G B k}, and 
A(G) = D{G), to obtain a set cover, since we then have 

F{A) = D{G)lAnG^0 = Yl D{G)l3keAMG 

GCV GCV 

= Y D{G)l3k(^A,G(^Sk- 

ggv 

Moreover, for a certain set cover defined hy W, Sk C W, k £ V, and 
A : Ty H> M+, define Gj = {k £ V, Sk B j} the subset of V of points 
that cover j £ W. We can then write the set cover as 

jew j£W 

to obtain a set-function expressed in terms of groups and non-negative 
weight functions. 

Applications to machine learning. Submodular set-functions 
which can be expressed as set covers (or equivalently as a sum of max- 
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imum of certain components) have several applications, mostly as reg- 
ular set-covers or through their use in sparsity-inducing norms. 

When used as set covers, submodular functions are traditionally 
used because algorithms for maximization with theoretical guarantees 
may be used (see Section [8]). See [88j for several applications. 

When used through their Lovasz extensions, we obtain structured 
sparsity-inducing norms which can be used to impose specific prior 
knowledge into learning problems: indeed, as shown in Section [2.31 they 
correspond to a convex relaxation to the set-function applied to the 
support of the predictor. Morever, as shown in [74^16] and in Section [X3| 
they lead to specific sparsity patterns (i.e., supports), which are stable 
for the submodular function, i.e., such that they cannot be increased 
without increasing the set-function. For this particular example, stable 
sets are exactly intersection of complements of groups G such that 
D{G) > (see more details in ^j), that is, some of the groups with 
non-zero weights carve out the set V to obtain the support of the 
predictor. Note that following [O^, all of these may be interpreted in 
terms of flows (see Section 13. 4p in order to obtain fast algorithms to 
solve the proximal problems. 

By choosing certain set of groups G such that D{G) > 0, we can 
model several interesting behaviors (see more details in ^): 

• Line segments: Given p variables organized in a sequence, 
using the set of groups of Figure 13. 4[ it is only possible to 
select contiguous nonzero patterns. In this case, we have p 
groups with non-zero weight, and the submodular function 
is equal, up to constants, to the length of the range of A 
(i.e., the distance beween the rightmost element of A and 
the leftmost element of A). 

• Two-dimensional convex supports: Similarly, assume 
now that the p variables are organized on a two-dimensional 
grid. To constrain the allowed supports to be the set of all 
rectangles on this grid, a possible set of groups to consider 
may be composed of half planes with specific orientations: 
if only vertical and horizontal orientations are used, the set 
of allowed patterns is the set of rectangles, while with more 
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Fig. 3.3: Flow (top) and set of groups (bottom) for sequences. When 
these groups have unit weights (i.e., D{G) = 1 for these groups and 
zero for all others), then the submodular function F{A) is equal to the 
number of sequential pairs with at least one present element. When 
applied to sparsity-inducing norms, this leads to supports that have no 
isolated points (see applications in [95j). 

general orientations, more general convex patterns may be 
obtained. These can be applied for images, and in particular 
in structured sparse component analysis where the dictionary 
elements can be assumed to be localized in space |78| . 

• Two-dimensional block structures on a grid: Using 
sparsity-inducing regularizations built upon groups which are 
composed of variables together with their spatial neighbors 
(see Figure I3.4|) leads to good performances for background 
subtraction [201 UHl ESI |95], topographic dictionary learn- 
ing [79l [96] , wavelet-based denoising [119j . 

• Hierarchical structures: here we assume that the variables 
are organized in a hierarchy. Precisely, we assume that the 
p variables can be assigned to the nodes of a tree (or a for- 
est of trees), and that a given variable may be selected only 
if all its ancestors in the tree have already been selected. 
This corresponds to a set-function which counts the number 
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Fig. 3.4: Flow (top) and set of groups (bottom) for sequences. When 
these groups have unit weights (i.e., D{G) = 1 for these groups and 
zero for all others), then the submodular function F{A) is equal (up to 
constants) to the length of the range of A (i.e., the distance beween the 
rightmost element of A and the leftmost element of A). When applied 
to sparsity-inducing norms, this leads to supports which are contiguous 
segments (see applications in [78]). 

of ancestors of a given set A (note that, as shown in Sec- 
tion 14. 3t the stable sets of this set-function are exactly the 
ones described above). 

This hierarchical rule is exactly respected when using the 
family of groups displayed on Figure 13.51 The corresponding 
penalty was first used in |140j : one of it simplest instance in 
the context of regression is the sparse group Lasso |129t fi8]: it 
has found numerous applications, for instance, wavelet-based 
denoising |1401 \T0\ [68l [77], hierarchical dictionary learning 
for both topic modelling and image restoration [76l [77] , log- 
linear models for the selection of potential orders |122j , bioin- 
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formatics, to exploit the tree structure of gene networks for 
multi-task regression [82], and multi-scale mining of fMRI 
data for the prediction of simple cognitive tasks [7S] . See also 
Section 19.31 for an application to non-parametric estimation 
with a wavelet basis. 
• Extensions: Possible choices for the sets of groups (and 
thus the set functions) are not limited to the aforementioned 
examples; more complicated topologies can be considered, 
for example three-dimensional spaces discretized in cubes or 
spherical volumes discretized in slices (see an application to 
neuroimaging by |134j ). and more complicated hierarchical 
structures based on directed acyclic graphs can be encoded 
as further developed in [5] to perform non-linear variable se- 
lection. 

Covers vs. covers. Set covers also classically occur in the context 
of submodular function maximization, where the goal is, given certain 
subsets of V, to find the least number of these that completely cover V. 
Note that the main difference is that in the context of set covers con- 
sidered here, the cover is considered on a potentially different set W 
than V, and each element of V indexes a subset of W. 

3.4 Flows 

Following [98J, we can obtain a family of non-decreasing submodular 
set-functions (which include set covers) from multi-sink multi-source 
networks. We define a weight function on a set W, which includes a 
set S of sources and a set V of sinks (which will be the set on which 
the submodular function will be defined). We assume that we are given 
capacities, i.e., a function c from W x W to M+. For all functions 
: ly X Ty — )• M, we use the notation ip{A, B) = XlfceA jeB vi^^j)- 
A flow is a function ip : W x W ^ M+ such that (a) ip ^ c for all 
arcs, (b) for all w £ W\{S U V), the net-flow at w, i.e., if{W,{w}) — 
ip{{w},W), is null, (c) for all sources s £ S, the net-flow at s is non- 
positive, i.e., ip{W,{s}) — ip{{s},W) ^ 0, (d) for all sinks t € V, the 
net-flow at t is non-negative, i.e., (p{W, {t}) — ip{{t}, W) ^ 0. We denote 
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Fig. 3.5: Left: Groups corresponding to a hierarchy. Right: network flow 
interpretation of same submodular function (see Section 13. 4p . When 
these groups have unit weights (i.e., D{G) = 1 for these groups and 
zero for all others), then the submodular function F{A) is equal to the 
cardinality of the union of all ancestors of A. When applied to sparsity- 
inducing norms, this leads to supports that select a variable only after 
all of its ancestors have been selected (see applications in ^76j). 



by 3" the set of flows. 

For A C V (the set of sinks) , we define 

F{A) = max ip{W, A) - ip{A, W), 

which is the maximal net-flow getting out of A. From the max- 
flow/min-cut theorem (see, e.g., [29J), we have immediately that 

F{A)= min c{X,W\X). 

XeVK, SdX, AizW\X 

One then obtain that F is submodular (as the partial minimization 
of a cut function, see Prop. IB.4P and non-decreasing by construction. 
One particularity is that for this type of submodular non-decreasing 
functions, we have an explicit description of the intersection of the 
positive orthant and the submodular polyhedron (potentially simpler 
than through the supporting hyperplanes {s{A) = F{A)}). Indeed, 
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s £ Ef^ belongs to P{F) if and only if, there exists a flow if G 3' such 
that for ah k G V , Sk = ^{W, {k}) — <y3({fc}, W) is the net-flow getting 
out of k. 

Similarly to other cut-derived functions, there are dedicated algo- 
rithms for proximal methods and submodular minimization |63) . See 
also Section 16.11 for a general divide-and-conquer strategy for solving 
separable optimization problems based on a sequence of submodular 
function minimization problems (here, min cut/max flow problems). 

Flow interpretation of set-covers. Following |95] . we now show 
that the submodular functions defined in this section includes the 
ones defined in Section 13.31 Indeed, consider a non-negative function 
D : 2^ ^ M+, and define F{A) = X^ccF Gr^A^0 ■^i'^)- '^^^^ Lovasz 
extension may be written as, for all w G (introducing variables 
in a scaled simplex reduced to variables indexed by G): 



u 



fiw) = ^ D{G)maxwk 
GcV 

Emax w'^t'' 
GcV 



max ^ "■•'^^^ 

tGm\, t^\G=o, tG(G)=D(G), Gcy 

= max 

tOm^+, t^\G=0' i''iO)=DiG), GcV^^y ,^^y^ 

Because of the representation of / as a maximum of linear functions 
shown in Prop. 12.21 s G P{F) Pi M^, if and only there exists t'^ G 
, t^^^ = 0, t^{G) = D{G) for ah G CV, such that for ah k gV, 
Sfc = Y^Q^y GBk^k- This Can be given a network flow interpretation 
on the graph composed of a single source, one node per subset G C V 
such that D{G) > 0, and the sink set V. The source is connected to 
all subsets G, with capacity D{G), and each subset is connected to the 
variables it contains, with infinite capacity. In this representation, 
is the flow from node corresponding to G, to the node corresponding 
to the sink node k; and Sk = YIgcV ^k net-flow in the sink k. 

Thus, s G P{F) n M!j_ if and only if, there exists a flow in this graph so 
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that the net-flow getting out of k is Sk, which corresponds exactly to a 
network flow submodular function. 

We give examples of such networks in Figure 13.31 and Figure 13.41 
This reinterpretation allows the use of fast algorithms for proximal 
problems (as there exists fast algorithms for maximum flow problems). 
The number of nodes in the network flow is the number of groups G such 
that D[G) > 0, but this number may be reduced in some situations. 
See [95 [ 1^ for more details on such graph constructions (in particular 
in how to reduce the number of edges in many situations). 

Application to machine learning. Applications to sparsity- 
inducing norms (as decribed in Section [3^ lead to applications to hier- 
archical dictionary learning and topic models [76] , structured priors for 
image denoising [T6\ [77] , background subtraction [95] , and bioinformat- 
ics \71\ 182] . Moreover, many submodular functions may be interpreted 
in terms of flows, allowing the use of fast algorithms (see, e.g., [631 E] 
for more details). 

3.5 Entropies 

Given p random variables Xi , . . . , Xp which all take a finite number of 
values, we define F{A) as the joint entropy of the variables {Xk)keA 
(see, e.g., [33])- This function is submodular because, ii A C B and 



k ^ F{A U {k}) - F{A) = H{XA,Xk) - H{Xa) = H{Xk\XA) ^ 
H{Xk\XB) = F{B U {k}) - F{B) (by the data processing inequal- 



ity [32]). Moreover, its symmetrizatioro leads to the mutual informa- 
tion between variables indexed by A and variables indexed by V\A. 

This can be extended to any distribution by considering differential 
entropies. One application is for Gaussian random variables, leading to 
the sub modularity of the function defined through F{A) = log det Qaa-, 
for some positive definite matrix Q E W^^ (see further related exam- 
ples in Section [3?6l) . 



^For any submodular function F, one may defined its symmetrized version as G{A) = 
F{A) + F{V\A) — F{V), which is submodular and symmetric. See further details in Sec- 
tion EH and Appendix IB. 21 
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Entropies are less general than submodular functions. En- 
tropies of discrete variables are non-increasing, non-negative submodu- 
lar set-functions. However, they are more restricted than this, i.e., they 
satisfy other properties which are not satisfied by all submodular func- 
tions |139] . Note also that it is not known if their special structure can 
be fruitfully exploited to speed up certain of the algorithms presented 
in Section [71 

Applications to probabilistic modelling. In the context of prob- 
abilistic graphical models, entropies occur in particular in algorithms 
for structure learning: indeed, for directed graphical models, given 
the directed acyclic graph, the minimum Kullback-Leibler divergence 
between a given distribution and a distribution that factorizes into 
the graphical model may be expressed in closed form through en- 
tropies [5^ ■ Applications of submodular function optimization may 
be found in this context, with both maximization [105| for learn- 
ing bounded-treewidth graphical model and minimization for learning 
naive Bayes models [86], or both (i.e., minimizing differences of sub- 
modular functions, as shown in Section [5]) for discriminative learning 
of structure [T06]. 

Entropies also occur in experimental design in Gaussian linear mod- 
els [125j . Given a design matrix X G M^^p, assume that the vector 
y G M" is distributed as Xw + ere, where w has normal prior distribu- 
tion with mean zero and covariance matrix a^X~^I, and e G M" is a 
standard normal vector. The posterior distribution of w given y is nor- 
mal with mean X~^a'^X{a'^\~^X~^ X + o"^/)~^y and covariance matrix 
A-V2/ - X-^a^X{a^X-'X^X + a^iy'X^ = X-^a^[l - X{X^ X + 
A/)-1aT] = A-V2[I - (XAT + A/)-1AAT] = cj2(AAT + XI)-^. 
The posterior entropy of w given y is thus equal (up to constants) to 
nloga^ — logdet(AA^ + XI). If only the observations in A are ob- 
served, then the posterior entropy of w given yA is equal to \A\ log cj^ — 
logdet(A^Aj + AI), which is supermodular because the entropy of a 
Gaussian random variable is the logarithm of its determinant. In ex- 
perimental design, the goal is to select the set A of observations so 
that the posterior entropy of w given yA is minimal (see, e.g., [S3]), 
and is thus equivalent to maximizing a submodular function (for which 
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forward selection has theoretical guarantees, see Section [82]) ■ Note the 
difference with subset selection (Section l3.7p where the goal is to select 
columns of the design matrix instead of rows. 

Application to semi-supervised clustering. Given p data points 
xi,...,Xp in a certain set X, we assume that we are given a Gaus- 
sian process {fx)xGX- For any subset A C V, then fx^ is normally 
distributed with mean zero and covariance matrix K^a where K is 
the p X p kernel matrix of the p data points, i.e., Kij = k{xi,Xj) 
where k is the kernel function associated with the Gaussian process 
(see, e.g., [120] ). We assume a modular prior distribution on subset 
of the form p{A) oc IlfceA ^fc nfe^AC-*^ ~ 'Hk) (i-e., each element k has a 
certain prior probability r/^ of being present, with all decisions being 
statistically independent). 

Once a set A is selected, we only assume that we want to model 
the two parts, A and V\A as two independent Gaussian processes with 
covariance matrices and Sy^y^. In order to maximize the likelihood 
under the joint Gaussian process, the best estimates are = Kaa and 
^V\A = Ky\A,v\A- This leads to the following negative log- likelihood 

Hf A, fv\A) -^'^og rik - J2 log(l-%)' 

fceA k&V\A 

where I{fA, fv\A) is the mutual information between two Gaussian pro- 
cesses (see similar reasoning in the context of independent component 
analysis [H]). 

We thus need to minimize a modular function plus a mutual in- 
formation between the variables indexed by A and the ones indexed 
by V\A, which is submodular and symmetric. Thus in this Gaussian 
process interpretation, clustering may be cast as submodular function 
minimization. This probabilistic interpretation extends the minimum 
description length interpretation of [108] to semi-supervised clustering. 

Note here that similarly to the unsupervised clustering framework 
of [1U8) . the mutual information may be replaced by any symmetric 
submodular function, such as a cut function obtained from appropri- 
ately defined weigths. In Figure 13.61 we consider X = and sample 
points from a traditional distribution in semi-supervised clustering, i.e., 
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Fig. 3.6: Examples of semi-supervised clustering : (left) observations, 
(middle) results of the semi-supervised clustering algorithm based on 
submodular function minimization, with eight labelled data points, 
with the mutual information, (right) same procedure with the cut func- 
tion. 



twe "two moons" dataset. We consider 100 points and 8 randomly cho- 
sen labelled points, for which we impose r]k G {0,1}, the rest of the 
rjk being equal to 1/2 (i.e, we impose a hard constraint on the labelled 
points to be on the correct clusters). We consider a Gaussian kernel 
k{x, y) = exp(— a||x— y|||), and we compare two symmetric submodular 
functions: mutual information and the weighted cuts obtained from the 
same matrix K (note that the two functions use different assumptions 
regarding the kernel matrix, positive definiteness for the mutual infor- 
mation, and pointwise positivity for the cut). As shown in Figure [3^ 
by using more than second-order interactions, the mutual information 
is better able to capture the structure of the two clusters. This ex- 
ample is used as an illustration and more experiments and analysis 
would be needed to obtain sharper statements. In Section [U we use 
this example for comparing different submodular function minimiza- 
tion procedures. Note that even in the case of symmetric submodular 
functions F, where more efficient algorithms in 0{p^) for submodular 
function minimization (SFM) exist |117] (see also Section [77i|) . the min- 
imization of functions of the form F(A) — z{A), for z G is provably 
as hard as general SFM |117j . 
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3.6 Spectral functions of submatrices 

Given a positive semidefinite matrix Q E M^'^^' and a real-valued func- 
tion h from M_|_ to M, one may define the matrix function [54j Q i— )• h{Q) 
defined on positive semi-definite matrices by leaving unchanged the 
eigenvectors of Q and applying h to each of the eigenvalues. This leads 
to the expression of tr[/i((5)] as Yli=i ^(-^i) where Ai, . . . , Ap are the 
(nonnegative) eigenvalues of Q [66]. We can thus define the function 
F{A) = tr /i(Qaa) for A C V. Note that for Q diagonal, we exactly 
recover functions of modular functions considered in Section 13.11 

The concavity of h is not sufficient however in general to ensure the 
submodularity of F, as can be seen by generating random examples 
with h{X) = A/(A + 1). 

Nevertheless, we know that the functions /i(A) = log(A + t) for 
t ^ lead to submodular functions since they lead to the entropy of a 
Gaussian random variable with joint covariance matrix Q + XI. Thus, 



since for p £ (0,1), A^" = ™^/Q°°log(l + \/t)tP-^dt (see, e.g., [3]), 



h{X) = for p G (0, 1] is a positive linear combination of functions 
that lead to non-decreasing submodular set-functions. We thus obtain 
a non-decreasing submodular function. 

This can be generalized to functions of the singular values of 
X{A,B) where X is a rectangular matrix, by considering the fact 
that singular values of a matrix X are related to the eigenvalues of 



Application to machine learning (Bayesian variable selection). 

As shown in [6j, such functions naturally appear in the context of vari- 
able selection using the Bayesian marginal likelihood (see, e.g., [52j). 
Indeed, given a subset A, assume that the vector y € M" is distributed 
as Xawa+ct^, where A is a design matrix in W^^'f and wa a vector with 
support in A, and e G M"" is a standard normal vector; if a normal prior 
with covariance matrix a^X~^I is imposed on wa, then the negative 
log-marginal likelihood of y given A (i.e., obtained by marginalizing 




(see, e.g., fSi]). 



3.7. Best subset selection 45 



wa), is equal to (up to constants) [126j: 

min ^\\y - XawaWI + TTjWwAf + ^ logdetla"^ X'^XaXJ + a'^I]. 

Thus, in a Bayesian model selection setting, in order to find the best 
subset A, it is necessary to minimize with respect to w: 

1 A 1 

™ii 2^''^"^'^''2+^ll^i'f +2 l°g^^*[^"^^^^SuppM^Jupp(^^ 

which, in the framework outlined in Section [2.31 leads to the submodu- 
lar function F{A) = i log det[A- V^X^Xj + a^I] = i log det [XaXJ + 
A/] + f log(A-V2). Note also that, since we use a penalty which is 
the sum of a squared ^2-norm and a submodular function applied to 
the support, then a direct convex relaxation may be obtained through 
reweighted least-squares formulations using the ^2-i'elaxation of com- 
binatorial penalties presented in Section 12.31 (see also |115) ) . See also 
related simulation experiments for random designs from the Gaussian 
ensemble in [6]. 

Note that a traditional frequentist criterion is to penalize larger 
subsets A by the Mallow's Cl criterion [97], which is equal to A ^ 
tr^XAX^ + XI)~^XaX^, which is not a submodular function. 

3.7 Best subset selection 

Following |3S], we consider p random variables (covariates) Xi, . . . , Xp, 
and a random response Y with unit variance, i.e., var(y) = 1. We 
consider predicting Y linearly from X. We consider F{A) = var(y) — 
vaT{Y\XA)- The function is a non-decreasing function (the condi- 
tional variance of Y decreases as we observed more variables) . In order 
to show the submodularity of F using Prop. 11.21 we compute, for all 
A cV, and i,j distinct elemetns in V\A, the following quantity: 

FiA U {j, k}) - F{A U {j}) - F{A U {k}) + F{A) 
= [var(y|XA,Xfc) - var(y|XA)] - [var(y|XA, X„ X^) - wsx{Y\Xa, Xj)] 
= -CoTT{Y,Xk\XAf + Corr(y,X,,|XA,X,)', 

using standard arguments for conditioning variances (see more details 
in ^36j)- Thus, the function is submodular if and only if the last quantity 
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is always non-positive, i.e., \Coii(Y, X^IXa, Xj)\ ^ |Corr(y, X^IXa)], 
which is often referred to as the fact that the variables Xj is not a 
suppressor for the variable Xj^. given A. 

Thus greedy algorithms for maximization have theoretical guaran- 
tees (see Section [8]) if the assumption is met. Note however that the 
condition on suppressors is rather strong, although it can be appropri- 
ately relaxed in order to obtain more widely applicable guarantees for 
subset selection [37] . 

Subset selection as the difference of two submodular func- 
tions. If we consider the linear model from the end of Section 13. 6^ 
then given a subset A, maximizing the log-likelihood with respect to 
WA and (T^, we obtain a negative log-likelihood of the form: 



= min log + —\\y\\l - —try'^ Xa{XJXa + Xiy'xjy 
= !^log^y-^{I-XAiXjXA + Xir'xJ)y + ^ 



which is a difference of two submodular functions (see Section 18.31 for 
related optimization schemes). This function is non-increasing, so in 
order to perform variable selection, it is necessary to add another crite- 
rion, which can be the cardinality of A; or in a Bayesian setting, we can 
replace the above maximization with respect to wa by a marginaliza- 
tion, which leads to an extra-term of the form ^ logdet(X^Xyi + XI), 
which does not change the type of minimization problems. 

Note the difference between this formulation (aiming at minimizing 
a set-function directly by marginalizing out or maximizing out w) and 
the one from Section 13.61 which provides a convex relaxation of the 
maximum likelihood problem by maximizing the likelihood with respect 
to w. 
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3.8 Matroids 

Given a set V, we consider a family 3 of subsets of V such that (a) 
e 3, (h) h C h e 3 ^ h £ J, and (c) for all /i,/2 G J, \Ii\ < 
I/2I =^ 3/c S hXh, h U {/c} G J. The pair {V, 3) is then referred to as a 
matroid, with 3 its family of independent sets. Then, the rank function 
of the matroid, defined as F{A) = max/^yi^ |/|, is submodularjf] 

A classical example is the graphic matroid; it corresponds to V 
being an edge set of a certain graph, and 3 being the set of subsets of 
edges which do not contain any cycle. The rank function p{A) is then 
equal to p minus the number of connected components of the subgraph 
induced by A. 

The other classical example is the linear matroid. Given a matrix M 
with p columns, then a set / is independent if and only if the columns 
indexed by / are linearly independent. The rank function p{A) is then 
the rank of the columns indexed by A (this is also an instance of func- 
tions from Section 13.61 because the rank is the number of non-zero 
eigenvalues, and when p — ?■ 0^, then A'' — t- 1a>o)- For more details on 
matroids, see, e.g., |124j . 

Greedy algorithm. For matroid rank functions, extreme points of 
the base polyhedron have components equal to zero or one (because 
F{A U {k}) - F{A) G {0, 1} for any ^ C F and k e V), and are in- 
cidence vectors of the maximal independent sets (maximal because of 
the constraint s{V) = F{V)). Thus, the greedy algorithm for maxi- 
mizing linear functions on the base polyhedron may be used to find 
maximum weight maximal independent sets, where a certain weight is 
given to all elements of V. In this situation, the greedy algorithm is 
actually greedy, that it first orders the weights of each element of V 
in decreasing order and select elements of V following this order and 
skipping the elements which lead to non-independent sets. 

For the graphic matroid, the base polyhedron is thus the convex 



•^This can be shown directly using Prop. [TTTI We first show that for any A CV, and k ^ A, 
then F{AU {k}) — F{A) £ {0, 1} as a consequence of the property (c). Then, we only need 
to show that if F(A U {k}) = F{A), then for all B greater than A (and that does not 
contain k), then F[B U {k}) = F{B), which is a consequence of property (b). 
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hull of the incidence vectors of sets of edges which form a spanning 
tree, and is often referred to as the spanning tree polytop^ [25]. The 
greedy algorithm is then exactly Kruskal's algorithm to find maximum 
weight spanning trees p9] . 

Minimizing matroid rank function minus a modular function. 

General submodular functions may be minimized in polynomial time 
(see Section!?]), but usually with large complexity, i.e., 0{p^). For func- 
tions which are equal to the rank function of a matroid minus a modular 
function, then algorithms have better running-time complexities, i.e., 

o(p3) [HiiiinH]. 



*Note that algorithms presented in Section [6] lead to algorithms for several operations on 
this spanning tree polytopes, such as line searches and orthogonal projections. 
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Properties of associated polyhedra 



We now study in more details submodular and base polyhedra defined 
in Section [H as well as the symmetric independent polyhedron (which 
is the unit dual ball for the norms defined in Section [2^ . We firt review 
that the support functions may be computed by the greedy algorithm, 
and then characterize the set of maximizers of linear functions, from 
which we deduce a detailed facial structure of the base polytope B{F) 
and the symmetric independence polyhedron |P|(F). 

4.1 Support functions 

The next proposition completes Prop. 12.21 by computing the full sup- 
port function of B(F) and P{F) (see [T71[TB] for definitions of support 
functions), i.e., computing maXg^^f^p-^ s and maXg^p(^p^ s for all 
possible w (with positive and/or negative coefficients). Note the differ- 
ent behaviors for B{F) and P{F). 

Proposition 4.1. (Support functions of associated polyhedra) 

Let F be a submodular function such that F[0) = 0. We have: 

(a) for all w E maXsgB(F) s = f{w), 

(b) if it) E M^, maxsgp(p) w'^ s = f{w), 
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(c) if there exists j such that wj < 0, then maxggp(^) s = +00, 

(d) if F is non-decreasing, for ah w S M^, max^gipK^?) s = f{\w\). 



Proof. The only statement left to prove beyond Prop. [22] and Prop. [231 
is (c): we just need to notice that s(A) = sq — X6j £ P{,F) for A — )• +00 
and So G P{F) and that s{X) — )■ +00. □ 

The next proposition shows necessary and sufficient conditions for 
optimality in the definition of support functions. Note that Prop. 12.21 
gave one example obtained from the greedy algorithm, and that we can 
now characterize all maximizers. Moreover, note that the maximizer is 
unique only when w has distinct values, and otherwise, the ordering of 
the components of w is not unique, and hence, the greedy algorithm 
may have multiple outputs (and all convex combinations of these are 
also solutions). The following proposition essentially shows what is ex- 
actly needed to be a maximizer. This proposition is key to deriving 
optimality conditions for the separable optimization problems that we 
consider in Section [5] and Section [H 



Proposition 4.2. (Maximizers of the support function of sub- 
modular and base polyhedra) Let F be a submodular function such 
that F{0) = 0. Let w E M^, with unique values vi > ■ ■ ■ > Vm, taken 
at sets Ai, . . . , Am (i.e., V = AiU ■ ■ ■ U Am and Vi G {1, . . . , m}, V/c G 
Ai, Wk = Vi). Then, 

(a) ii w £ {M*^y, s is optimal for max^gp^j?) w'^ s if and only if for all 
i = 1, . . . , m, s{Ai U • • • U ^i) = F{Ai U---UAi), 

(b) s is optimal for maXggp(p) w'^ s if and only if for all i = 1, . . . , m, 
s{Ai U---UAi) = F{Ai U • • • U Ai). 



Proof. We first prove (a). Let Bi = Ai L) ■ ■ ■ L) Ai, for i = 1, . . . ,m. 
From the optimization problems defined in the proof of Prop. 12. 2[ let 
Ay = > 0, and Ap^ = Vi — fj+i > for i < m, with all other A^, 
A C V, equal to zero. Such A is optimal (because the dual function is 
equal to the primal objective f{w)). 
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Let s G P{F). We have: 

m—l 

XaF{A) = VmF{V) + J2 HBi){vi - Vi+i) 
AcV 1=1 

m—l 

= v„,{FiV) - siV)) + [F{Bi) - s{Bi)]{vi - Vi+i) 

i=l 
m—l 

+VmS{V) + Y s{Bi){vi - Vi+i) 
1=1 

m—l 

^ VmS{V) + Y s{Bi){Vi - Vi+l) = S^W. 
i=l 

Thus s is optimal, if and only if the primal objective value s^w is 
equal to the optimal dual objective value YIaclV ^aF{A), and thus, if 
and only if there is equality in all above inequalities, hence the desired 
result. The proof for (b) follows the same arguments, except that we 
don't need to show that siy) = F(y), since this is always satisfied for 
s G B{F), hence we don't need Vm > 0- □ 

Note that for (a), if Vm = in Prop. 14.21 (i.e., we take w £ M^J^ and 
there is a Wk equal to zero), then the optimality condition is that for 
all i = 1, . . . , m - 1, s{Ai U ■ ■ ■ U Ai) = F{Ai U • • • U vlj) (i.e., we don't 
need that s{V) = F(V), i.e., the optimal solution is not necessarily in 
the base polyhedron). 

4.2 Facial structure 

In this section, we describe the facial structure of the base polyhedron. 
We first review the relevant concepts for convex polytopes. 

Face lattice of a convex polytope. We quickly review the main 
concepts related to convex polytopes. For more details, see |56j . A con- 
vex polytope is the convex hull of a finite number of points. It may be 
also seen as the intersection of finitely many half-spaces (such intersec- 
tions are referred to as polyhedra and are called polytopes if they are 
bounded). Faces of a polytope are sets of maximizers of s for cer- 
tain w GM.P. Faces are convex sets whose affine hulls are intersections 
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of the hyperplanes defining the half-spaces from the intersection of half- 
space representation. The dimension of a face is the dimension of its 
affine hull. The {p — l)-dimensional faces are often referred to as facets, 
while zero-dimensional faces are its vertices. A natural order may be 
defined on the set of faces, namely the inclusion order between the sets 
of hyperplanes defining the face. With this order, the set of faces is 
a distributive lattice [38], with appropriate notions of "join" (unique 
smallest face that contains the two faces) and "meet" (intersection of 
the two faces). 

Dual polytope. We now assume that we consider a polytope with 
zero in its interior (this can be done by projecting it onto its affine hull 
and translating it appropriately). The dual polytope of C is the polar 
set C° of the polytope C (see Appendix !^ . It turns out that faces of C° 
are in bijection with the faces of C, with vertices of C mapped to facets 
of C° and vice- versa. If C is represented as the convex hull of points Sj, 
i E {1, . . . , m}, then the polar of C is defined through the intersection 
of the half-space {w G MP, sjw ^ 1}, for i = 1, . . . , m. Analyses and al- 
gorithms related to polytopes may always be defined or looked through 
their dual polytopes. In our situation, we consider two polytopes, B{F) 
for which the dual polytope is the set {w,f{w) ^ l,w~^lv = 0} (see 
an example in Figure 12. 2p , and the symmetric independent polytope 
\P\{F), whose dual polytope is the unit ball of the norm O defined in 
Section 12.31 See Figure 12.31 for examples of these polytopes, and also 
Section 14.31 

Faces of the base polyhedron. Given the Prop. that provides 
the maximizers of maXg^^i^p^ s, we may now give necessary and 
sufficient conditions for characterizing faces of the base polyhedron. 
We first characterize when the base polyhedron B{F) has non-empty 
interior within the subspace {s{V) = F(y)}. 



Definition 4.1. (Inseparable set) Let F be a submodular function 
such that F{0) = 0. A set A CV is said separable if and only there is 
a set B C A, such that B 0, B A and F{A) = F{B) + F{A\B). 
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If A is non separable, A is said inseparable. 



Proposition 4.3. (Full-dimensional base polyhedron) Let F be 

a submodular function such that F(0) = 0. The base polyhedron has 
non-empty interior in {s(l^) = F(y)} if and only if V is not separable. 



Proof. If V is separable into A and V\A, then, by submodularity of F, 
for all s £ B{F), we must have s{A) = F{A) (and thus also F{V\A) = 
s(y\A)) and hence the base polyhedron is included in the intersection 
of two affine hyperplanes, i.e., B{F) does not have non-empty interior 
in {s{V) = F{V)]. 

Since B{F) is defined through supporting hyperplanes, it has non- 
empty interior in {siy) = F{V)} if it is not contained in any of the 
supporting hyperplances. We thus now assume that B[F) is included 
in {s{A) = F{A)}^ for ^ as a non-empty strict subset of V . Then B{F) 
can be factorized in to B{Fa) x B{F^) where Fa is the restriction of F 
to A and F"^ the contraction of F on ^ (see definition and properties in 
Appendix [R2|) . Indeed, if s € B{F), then sa e B{Fa) because s{A) = 
F{A), and sv\a e S(F^), because for B C V\A, sv\a{B) = s{B) = 
s{AUB)-siA) ^ F{AUB)- F{A). Similarly if s G B{Fa) x B{F^), 
then for ah set B CV, s{B) = s{An B) + S{{V\A) n B) ^ F{Ar\B) + 
F{A UB)- F{A) ^ F{B) by submodularity and s{A) = F{A). 

This shows that f{w) = fA{wA) + f^{'Wv\A)^ which implies that 
F{V) = F{A) + F(y\A), when applied to w = ly, i.e., V is separable. 

□ 

We can now detail the facial structure of the base polyhedron, which 
will be dual to the one of the polyhedron defined by {w G MP, f{w) ^ 
l,w~^lv = 0} (i.e., the sub-level set of the Lovasz extension projected 
on a subspace of dimension p — 1). As the base polyhedron B{F) is a 
polytope in dimension p — 1 (because it is bounded and contained in 
the affine hyperplane {siV) = FiV)}), one can define a set of faces. As 
described earlier, faces are the intersections of the polyhedron B{F) 
with any of its supporting hyperplanes. Supporting hyperplanes are 
themselves defined as the hyperplanes {s{A) = F{A)} for A C V. 
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From Prop. 14. 2^ faces are obtained as the intersection of B{F) with 
s{Ai U ■■■ U Ai) = F{Ai U • • • U ^i) for an ordered partition of V. 
Together with Prop. 14.31 we can now provide characterization of the 
faces of B{F). See more details on the facial structure of B{F) in [39]. 

Since the facial structure is invariant by translation, as done at the 
end of Section [2. II we may translate B{F) by a certain vector t G B(F), 
so that F may be taken to be non-negative and such that F{V) = 
(as done at the end of Section [2.ip . which we now assume. 

Proposition 4.4. (Faces of the base polyhedron) Let ^lU- • -UAm 
be an ordered partition of V, such that for all j G {1, . . . ,m}, Aj is 
inseparable for the function Gj : B ^ F{Ai U • • • U Aj^i U B) - F{Ai U 
• • • U Aj^i) defined on subsets of Aj, then the set of bases s E B{F) 
such that for ah j e {1,..., m}, s{Ai U ■ ■ ■ U Ai) = F{Ai U • • • U ^i) 
is a face of B{F) with non-empty interior in the intersection of the m 
hyperplanes (i.e., the affine hull of the face is exactly the intersection 
of these m hyperplanes). Moreover, all faces of B[F) may be obtained 
this way. 

Proof. From Prop. 14.21 all faces may be obtained with supporting hy- 
perplanes of the form s(AiU- • -UAj) = F{AiU- ■ -UAi), i = 1, . . . , m, for 
a certain partition V = Ai^J- ■ - yjAm- Hovever, among these partitions, 
only some of them will lead to an affine hull of full dimension m. From 
Prop. HT3l applied to the submodular function Gj, this only happens if 
Gj has no separable sets. □ 

Note that in the previous proposition, several ordered partitions may 
lead to the exact same face. The maximal number of full-dimensional 
faces of B{F) is always less than 2^^ — 2 (number of non-trivial sub- 
sets of V), but this number may be reduced in general (see examples 
in Figure [22]). Moreover, the number of extreme points may also be 
large, e.g., p\ for the submodular function A i— >• — (leading to the 
permutohedron [i9]). 

Note that the previous discussion implies that we have also a char- 
acterization of the faces of the dual polytope U = {w G W,f{w) ^ 
l,w~^lv = 0} (note that because we have assumed that F is non- 
negative and F(y) = 0, then / is pointwise positive and satisfies 
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/(ly) = 0). In particular, the faces of II are obtained from the faces 
of B{F) through the relationship defined in Prop. I4.2t that is, given 
a face of B{F), and all the ordered partitions of Prop. 14.41 which lead 
to it, the corresponding face of U is the closure of the union of all w 
that satisfies the level set constraints imposed by the different ordered 
partitions. As shown in |7j, the different ordered partitions all share the 
same elements but with a different order, thus inducing a set of partial 
constraints between the ordering of the m values w is allowed to take. 

An important aspect is that the the separability criterion in 
Prop. 14.41 forbids some level sets from being characteritistic of a face. 
For example, for cuts in an undirected graph, this shows that all level 
sets within a face must be connected components of the graph. When 
the Lovasz extension is used as a constraint for a smooth optimization 
problems, the solution has to happen in one of the faces. Moreover, 
within this face, all other affine constraints are very unlikely to happen, 
unless the smooth function has some specific directions of zero gradi- 
ent (unlikely with random data, for some sharper statements, see [7]). 
Thus, when using the Lovasz as a regularizer, only certain level sets are 
likely to happen, and in the context of cut functions, only connected 
sets are allowed, which is one of the justifications behind using the total 
variation. 

4.3 Symmetric independence polyhedron 

We now assume that the function F is non-decreasing, and consider 
the symmetric independence polyhedron |P|(F), which is the unit ball 
of the dual norm fi* defined in Section 12. 3i This polytope is dual to 
the unit ball of il., and it it thus of interest to characterize the facial 
structure of |P|(F). We need the additional notion of stable sets. 



Definition 4.2. (Stable sets) A set A C F is said stable for a sub- 
modular function F, Ac B and B implies that F{A) < F{B). 



We first derive the same proposition than Prop. 14.21 for the sym- 
metric independence polyhedron. 



56 Properties of associated polyhedra 



Proposition 4.5. (Maximizers of the support function of sym- 
metric independence polyhedron) Let F he a, non-decreasing sub- 
modular function such that F{0) = 0. Let w G M^, with unique 
values for \w\, vi > ■ ■ ■ > Vm > 0, taken at sets Ai,...,Am- Then 
s is optimal for max^giPKi?) if and only if for all i = l,...,m, 
|s|(^i U • • • U Ai) = F{Ai U • • • U Ai), and w and s have the same signs. 



Proof. The proof follows the same arguments than for Prop. 14.21 □ 

Note that in the previous proposition, if f ^ = in Prop. [^?2] (i.e.. we 
take w G RP with some zero components, then the optimality condition 
is that for ah i = 1, . . . ,m - 1, \s\{Ai U ■ ■ ■ U Ai) = F{Ai U • • • U ^i) 
(i.e., we don't need that |s|(T^) = F{V), that is, the optimal solution 
is not necessarily in the base polyhedron). Moreover, the value of Sk 
when Wk = is irrelevant (given that \s\ € P{F)). 

We can now derive a characterization of the faces of |P|(F). 



Proposition 4.6. (Faces of the symmetric independence poly- 
hedron) Let C be a stable set and Let j4i U • • • U Am be an ordered 
partition of C, such that for all j G {!,..., m}, Aj is inseparable for the 
function Gj : B ^ F{Ai U • • • U Aj_i U B) - F{Ai U • • • U ^j-i) defined 
on subsets of Aj, and e G { — 1,1}'^, then the set of bases s G B{F) 
such that for ah j G {!,..., m}, (eo s)(^i U • • • U ^j) = F(Ai U • • • U ^i) 
is a face of |P|(F) with non-empty interior in the intersection of the m 
hyperplanes. Moreover, all faces of |P|(-F) may be obtained this way. 



Proof. The proof follows the same structure than for Prop. 14.41 but by 
applying Prop. 14.51 The requirement for stability, comes from the fact 
that if C is not stable, then if D is a larger set such that F{D) = F{C), 
we have the additional constraint (s o e)(Z)\C) = 0. □ 

The last proposition has interesting consequences for the use of 
submodular functions for defining sparsity-inducing norms. Indeed, the 
faces of the unit-ball of are dual to the ones of the dual ball of 
(which is exactly |P|(F)). Moreover, as a consequence of Prop. 14. 5^ 
the set C in Prop. 14.61 corresponds to the non-zero elements of t/; in a 
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face of the unit-ball of 0,. This implies that all faces of the unit ball of 
Q will only impose non-zero patterns which are stable sets. Note here 
the relationship with w i— )■ -F(Supp(it;)), which would share the same 
property; that is, when this function is used to regularize a continuous 
objective function, then a stable set is always solution of the problem, 
as augmenting unstable sets does not increase the value of F, but can 
only increase the minimal value of the continuous objective function 
because of an extra variable to optimize upon. 

However, the faces of |P|(-F) are not all related to non-zero pat- 
terns, and, as before, and as shown in Figure [2^ there are additional 
singularities, which may come as desired or undesired (see |115j ). 

Stable inseparable sets. We end the description of the structure 
of |P|(-F) by noting that among the 2^ — 1 constraints of the form 
\\sa\\i ^ defining it, we may restrict the sets A to be stable and 

inseparable. Indeed, if ^ for all stable and inseparable sets 

A, then if B is not stable, then we may consider the smallest enclosing 
stable set (these are stable by intersection, hence the possibility of 
defining such smallest enclosing stable set) C, and we have ^ 
llsclli, and F{B) = F{C). We thus need to show that ||sc||i ^ F{C) 
only for stable sets. If the set C is separable into C = Di U • • • U 
Dm, where all Dj, z = 1, . . . ,m are separable, they must all be stable 
(otherwise C would not be), and thus we have ||sc||i = pDi ||i + • • • + 
hoAi ^ F{Di) + ■■■ + F{Dm) = F{C). 
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In this section, we consider separable convex functions and the mini- 
mization of such functions penaUzed by the Lovasz extension of a sub- 
modular function. When the separable functions are all quadratic func- 
tions, those problems are often referred to as proximal problems and 
are often used as inner loops in convex optimization problems regular- 
ized by the Lovasz extension (see a brief introduction in Section 15.11 
and, e.g., [28l|8] and references therein). In this section, we consider re- 
lationships between separable optimization problems and general sub- 
modular minimization problems, and focus on a detailed analysis of the 
equivalent between these; for corresponding algorithms, see Section [6l 

5.1 Convex optimization with proximal methods 

In this section, we briefly review proximal methods which are convex 
optimization methods particularly suited to the norms we have defined. 
They essentially allow to solve the problem regularized with a new norm 
at low implementation and computational costs. For a more complete 
presentation of optimization techniques adapted to sparsity-inducing 
norms, see [8]. Proximal-gradient methods constitute a class of first- 
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order techniques typically designed to solve problems of the following 
form pl3l[Tn[28]: 

min g{w) + h{w), (5-1) 



where g is smooth. They take advantage of the structure of Eq. ()5.ip 
as the sum of two convex terms, only one of which is assumed smooth. 
Thus, we will typically assume that g is differentiable (and in our situ- 
ation in Eq. ()2.6p . that the loss function £ is convex and differentiable), 
with Lipschitz-continuous gradients (such as the logistic or square loss) , 
while h will only be assumed convex. 

Proximal methods have become increasingly popular over the past 
few years, both in the signal processing (see, e.g., [121 1137^ [28] and 
numerous references therein) and in the machine learning communi- 
ties (see, e.g., [8] and references therein). In a broad sense, these meth- 
ods can be described as providing a natural extension of gradient-based 
techniques when the objective function to minimize has a non-smooth 
part. Proximal methods are iterative procedures. Their basic princi- 
ple is to linearize, at each iteration, the function g around the current 
estimate w, and to update this estimate as the (unique, by strong con- 
vexity) solution of the following proximal problem: 

L. 



mm 



f{w) + {w — w) f'{w) + Xh{w) + —\\w — w\\2 



(5.2) 



The role of the added quadratic term is to keep the update in a neigh- 
borhood of w where / stays close to its current linear approximation; 
L > is a parameter which is an upper bound on the Lipschitz constant 
of the gradient /'. 

Provided that we can solve efficiently the proximal problem in 
Eq. (|5.2p . this first iterative scheme constitutes a simple way of solv- 
ing problem in Eq. ()5.ip . It appears under various names in the liter- 
ature: proximal-gradient techniques |113j . forward-backward splitting 
methods [28], and iterative shrinkage-thresholding algorithm Fur- 
thermore, it is possible to guarantee convergence rates for the function 
values [1131 111], and after t iterations, the precision be shown to be of 
order 0{l/t), which should contrasted with rates for the subgradient 
case, that are rather 0{l/y/i). 
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This first iterative scheme can actuahy be extended to "acceler- 
ated" versions \113\ [TT] . In that case, the update is not taken to be 
exactly the result from Eq. (|5.2p : instead, it is obtained as the solution 
of the proximal problem applied to a well-chosen linear combination 
of the previous estimates. In that case, the function values converge 
to the optimum with a rate of 0(l/t^), where t is the iteration num- 
ber. Prom |112j . we know that this rate is optimal within the class 
of first-order techniques; in other words, accelerated proximal-gradient 
methods can be as fast as without non-smooth component. 

We have so far given an overview of proximal methods, without 
specifying how we precisely handle its core part, namely the computa- 
tion of the proximal problem, as defined in Eq. (|5.2|) . 

Proximal Problem. We first rewrite problem in Eq. (j5.2p as 



1 

min — 

toeKP 2 



1 2 ^ 

w - {w - Y f (w)) ^ + -h{w). 



L- 

Under this form, we can readily observe that when A = 0, the solution 
of the proximal problem is identical to the standard gradient update 
rule. The problem above can be more generally viewed as an instance 
of the proximal operator [100] associated with A/i: 

ProxA/i : u G I— >• argmin— — v\\^ + Xh{v). 
v&RP 2 

For many choices of regularizers h, the proximal problem has a 
closed-form solution, which makes proximal methods particularly effi- 
cient. If is chosen to be the £i-norm, the proximal operator is simply 
the soft-thresholding operator applied elementwise [32] • In this paper 
the function h will be either the Lovasz extension / of the submodular 
function F, or, for non-decreasing submodular functions, the norm O 
defined in Section 12.31 In both cases, the proximal operator is exactly 
one of the separable optimization problems we consider in this section. 

5.2 Optimality conditions for base polyhedra 

Throughout this section, we make the simplifying assumption that 
the problem is strictly convex and differentiable (but not necessar- 
ily quadratic) and such that the derivatives are unbounded, but sharp 
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statements could also be made in the general case. The next propo- 
sition shows that by convex strong duality (see Appendix |A|), it is 
equivalent to the maximization of a separable concave function over 
the base polyhedron. 



Proposition 5.1. (Dual of proximal optimization problem) 

Let tpi,..., ipp he p continuously differentiable strictly convex func- 
tions on M such that for all j £ V, functions such that 
sup^gjg '(/'j-(q^) = +00 and mia£Rip'j{a) = —oo. Denote ipl, . . . ,Tp* their 
Fenchel-conjugates (which then have full domain). The two following 
optimization problems are dual of each other: 

p 



The pair (w,s) is optimal if and only if (a) Sfc = —tp'j^iwk) for all 
k G {1, . . . and (b) s G B{F) is optimal for the maximization of 
w'^ s over s G B{F) (see Prop. IT^ for optimality conditions). 



Proof. We have assumed that for all j £ V, functions are such 
that sup(^giR (ck) = +00 and inf (a) = — c«. This implies that 
the Fenchel-conjugates ip* (which are already differentiable because of 
the strict convexity of ipj |16) ) are defined and finite on M, as well 
as strictly convex. We have (since strong duality applies because of 
Fenchel duality, see Appendix lA. 21 and [16]): 




(5.3) 



p 




(5.4) 



p 



p 




p 




p 
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where ipj is the Fenchel-conjugate of ipj (which may in general have a 
domain strictly included in R). Thus the separably penalized problem 
defined in Eq. (j5.3p is equivalent to a separable maximization over the 
base polyhedron (i.e., Eq. (j5.4p ). Moreover, the unique optimal s for 
Eq. ()5.4p and the unique optimal w for Eq. (15. 3p are related through 
Sj = —tp'j{wj) for ah j GV. □ 

5.3 Equivalence with submodular function minimization 

Following |21j . we also consider a sequence of set optimization problems, 
parameterized by a G M: 

minF(^) + ^V;(a). (5.5) 

jeA 



We denote by A'^ any minimizer of Eq. (15. 5p . Note that A"^ is a min- 
imizer of a submodular function F + ip'{a), where il^'{a) G is the 
vector of components ip'j^{a), k £ {1, . . . ,p}. 

The key property we highlight in this section is that, as shown 
in [21], solving Eq. (15. 3p . which is a convex optimization problem, is 
equivalent to solving Eq. (|5.5p for all possible a G M, which are sub- 
modular optimization problems. We first show a monotonicity property 
of solutions of Eq. (15. 5p (following [21j). 

Proposition 5.2. (Monotonicity of solutions) Under the same as- 
sumptions than in Prop. 15. H if a < /3, then any solutions A°' and A^ 
of Eq. (f53|) for a and (3 satisfy A^ C A°'. 

Proof. We have, by optimality of A" and A^: 

F{A^)+Y,i^'M) ^ F(^"U^^)+ ^K") 

F{AP)+Y,^'j{P) ^ F{A^r^AP)+ 

j&AP jeA'^nA'' 

and by summing the two inequalities and using the submodularity of -F, 
j;^;.(a)+^V;-(/3)^ Y E 
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which is equivalent to YljeAi^\A"(.'^j(l^) ~ '^'ji'^)) ^ 0' which imphes, 
since for ah j G V, il^'j^fi) > tpj{o:) (because of strict convexity), that 
= 0. □ 



The next proposition shows that we can obtain the unique solution 
of Eq. (j5.3p from all solutions of Eq. (|5.5p . 



Proposition 5.3. (Proximal problem from submodular func- 
tion minimizations) Under the same assumptions than in Prop. [STTl 
given any solutions A" of problems in Eq. ()5.5p . for all a G M, we define 
the vector n G as 



Uj = sup({a G M, j G A""}). 



Then u is the unique solution of the convex optimization problem in 
Eq. do}. 



Proof. Because infagR V'j(«) = — oo, for a small enough, we must have 
= V, and thus Uj is well-defined and finite for all j G V. 
If a > Uj, then, by definition of Uj, j ^ A". This implies that 

A'^ C {j G V, Uj ^ a} = {u ^ a}. Moreover, if Uj > a, there exists /3 G 

{a,Uj) such that j G A^ . By the monotonicity property of Prop. 15. 2^ 

A^ is included in A°'. This implies {u > a} C A"'. 

We have for all w & W, and /? less than the smallest of (wj)^ and 
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the smallest of (u^ 



/ F{{u;^ a})da+ {F{{u a}) - F{V))da 
Jo J 



p 



da 
p 



/•OO f 

C+ F({n ^ a}) + 

with C = / F{V)da + V ^jiP) 
Jo ,=1 

/>oo r P 

C + / F({u; ^ a}) + V( W)jV'Ua 

J5 . ~; 



da by optimality of 



This shows that u is the unique optimum of problem in Eq. (|5.3|) . □ 

From the previous proposition, we also get the following corollary, 
i.e., all solutions of Eq. (jS.Sp may be obtained from the unique solution 
of Eq. (15. 3p . Note that we immediately get the maximal and minimal 
minimizers, but that there is no general characterization of the set of 
minimizers (which is a lattice because of Prop. [TTT]) . 

Proposition 5.4. (Submodular function minimizations from 
proximal problem) Under the same assumptions than in Prop. lSTTl if 
u is the unique minimizer of Eq. (15.30 . then for all a E M, the minimal 
minimizer of Eq. ()5.5p is {u > a} and the maximal minimizer is {u ^ 
a}, that is, for any minimizers A", we have {u > a} C C {u ^ a}. 

Proof. From the definition of the supremum in Prop. 15.31 then we im- 
mediately obtain that {u > a} G A^ C {it ^ a} for any minimizer 
AP^ . Moreover, if a is not a value taken by some itj, j G V ^ then this 
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defines uniquely A". If not, then we simply need to show that {u ^ a} 
and {u > a} are indeed maximizers, which can be obtained by taking 
limits of when /3 tends to a from below and above. □ 



Duality gap. We can further show that for any s G B{F) and w £ 



j=i ^ 



f{w) - w'^s + y]{ i^jiwj) + i^^i-Sj) + WjSj \ (5.6) 

+00 



{F + tp'{a)){{w ^ a}) - (s + V'(a))~ [V] }da. 



Thus, the duality gap of the separable optimization problem in 
Prop. 15. H may be written as the integral of a function of a. It turns 
out that, as a consequence of Prop. 17.31 (Section [7|), this function of 
Q is the duality gap for the minimization of the submodular function 
F + 'ip'{a). Thus, we obtain another direct proof of the previous propo- 
sitions. Eq. (j5.6|) will be particularly useful when relating approximat 
solution of the convex optimization problem to approximate solution 
of the combinatorial optimization problem of minimizing a submodular 
function (see Section [73]) . 

5.4 Quadratic optimization problems 

When specializing Prop. ET] and EH to quadratic functions, we obtain 
the following corollary, which shows how to obtain minimizers of F{A) + 
X\A\ for all possible A € M from a single convex optimization problem: 

Proposition 5.5. (Quadratic optimization problem) Let F be 

a submodular function and G the unique minimizer of tt; i— )• 
f{w) + ^\\w\\l. Then: 

(a) s = —w is the point in B{F) with minimum -^2-norm, 

(b) For all A € M, the maximal minimizer of j4 i— )• F{A) + A|^| is 
{w ^ — A} and the minimal minimizer of F is {w > — A}. 



One of the consequences of the last proposition is that some of the 
solutions to the problem of minimizing a submodular function sub- 
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ject to cardinality constraints may be obtained directly from the solu- 
tion of the quadratic separable optimization problems (see more details 
in [M]). 



Primal candidates from dual candidates. From Prop. 15.51 given 
the optimal solution s of maXgeBlF) "^INIIi) obtain the optimal 
solution w = —s oi min^gjRp f{w) + ^||?i;||2. However, when using ap- 
proximate algorithms such as the ones presented in Section [6l one may 
actually get only an approximate dual solution s, and in this case, one 
can improve on the natural candidate primal solution w = —s. In- 
deed, assume that the components of s are sorted in increasing order 
Sji ^ ••• ^ Sjp, and denote t £ B{F) the vector defined by tj^ = 
F{{ji,- ■ ■ , jfe}) - ^({ji, • • • , Jfc-i}) • Then we have f{-s) = t~^{-s), 
and for any w such that wj-^ ^ • • • ^ wj^, we have f{w) = w'^t. Thus, 
by minimizing w~^t + ^Hit^Hl subject to this constraint, we improve on 
the choice w = —s. Note that this is exactly an isotonic regression 
problem with total order, which can be solved simply and efficiently in 
0{p) by the "pool adjacent violators" algorithm (see, e.g., [H]). In Sec- 
tion [9l we show that this leads to much improved approximate duality 
gaps. 



Additional properties. Proximal problems with the square loss 
exhibit further interesting properties. For example, when considering 
problems of the form min^gup Xf{w) + ■^\\w — zH^, for varying A, some 
set-functions (such as the cut in the chain graph) leads to an agglom- 
erative path, i.e., as A increases, components of the unique optimal 
solutions cluster together and never get separated |7j|. 

Also, one may add an additional £i-norm penalty to the regularized 
quadratic separable problem defined above, and it is shown in [7J that, 
for any submodular function, the solution of the optimization problem 
may be obtained by soft-thresholding the result of the original proxi- 
mal problem (note that this is not true for all separable optimization 
problems). 
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5.5 Separable problems on other polyhedra 

We now show how to minimize a separable convex function on the sub- 
modular polyhedron or the symmetric independent polyhedron (rather 
than on the base polyhedron). We first show the following proposition 
for the submodular polyhedron of any submodular function (non neces- 
sarily non-decreasing) , which relates the unrestricted proximal problem 
with the proximal problem restricted to M^. 

Proposition 5.6. (Separable optimization on the submodular 
polyhedron) Assume that F is submodular. Let tpj, j = 1, . . . ,phe p 
convex functions such that -0^ is defined and finite on M. Let (v, t) be 
a primal-dual optimal pair for the problem 

mm f{v) + Mvk) = max - ^ Vfc(-tfc). 

For k €zV, let Sk be a maximizer of —ipl.{—Sk) on (— oo,tfc]. Define 
w = v^. Then (w, s) is a primal-dual optimal pair for the problem 

mill /(w) + V Vfc(wfc) = max -^i^Usk)- 



Proof. The pair (tt;,s) is optimal if and only if (a) WkSk + il^k{wk) + 
ipl{—Sk) = 0, i.e., {wk, Sk) is a Fenchel-dual pair for ipk, and (b) f{w) = 
s'^w. The first statement (a) is true by construction (indeed, if Sk = tk, 
then this is a consequence of optimality for the first problem, and if 
Sk < tk, then Wk = (V'fc)'(-'Sfc) = 0). 

For the second statement (b), notice that s is obtained from t by 
keeping the components of t corresponding to strictly positive values 
of V (let K denote that subset), and lowering the ones for V\K. For 
a > 0, the level sets {w ^ a} are equal to {v a} C K. Thus, by 
Prop. 14. 2t all of these are tight for t and hence for s because these 
sets are included in K, and sk = tx- This shows, by Prop. 14. 2^ that 
s S P{F) is optimal for max5gp(j7) w'^ s. □ 

Note that Prop. \5M involves primal-dual pairs {w,s) and (f,t), but 
that we can define w from v only, and define s from t only; thus. 
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primal-only views and dual-only views are possible. This also applies 
to Prop. 15.71 which extends Prop. 15.61 to the symmetric independent 
polyhedron (we denote hy a o b the pointwise product between two 
vectors of same dimension). 

Proposition 5.7. (Separable optimization on the symmetric 
independent polyhedron) Assume that F is submodular and non- 
decreasing. Let il^j, j = 1, . . . ,p he p convex functions such that ip* is 
defined and finite on M. Let G { — 1,1} denote the sign of (V'fc)'(O) 
(if it is equal to zero, then the sign can be —1 or 1). Let {v,t) be a 
primal-dual optimal pair for the problem 

min f{v) + y^ ^pk{£kVk) = max - Vfc(-efcifc)- 

Let w = £ o (v-y') and be times a maximizer of —ip1i—Sk) on 
{—oo,tk]- Then (w,s) is a primal-dual optimal pair for the problem 

unnf{\w\) + ^Mwk) = ^ max^^-g^^(-Sfc). 



Proof. Because / is non-decreasing with respect to each of its compo- 
nent, we have: 

min /(|w|) + ipkiwk) = min f{v) + ipk{ekVk)- 
fcey + k&v 



We can thus apply Prop. 15.71 to Wk ^ fpki^kWk), which has Fenchel 
'jki^kSk) (because el 



conjugate Sk i— )■ ipliskSk) (because = 1), to get the desired result. □ 



Applications to sparsity-inducing norms. Prop. 15.71 is particu- 
larly adapted to sparsity-inducing norms defined in Section [231 as it de- 
scribes how to solve the proximal problem for the norm i}(w) = f{\w\). 
For a quadratic function, i.e., i/^kiwk) = ^{wk — Zk)^ and ipl{sk) = 
+ SfcZfc. Then is the sign of z^, and we thus have to minimize 

mm/(^) + ^ j;(^;fe-|z,|)2, 
k&V 
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which is the classical quadratic separable problem on the base polyhe- 
dron, and select w = e o v+. Thus, proximal operators for the norm Q, 
may be obtained from the proximal operator for the Lovasz extension. 



6 



Separable optimization problems - Algorithms 



In the previous section, we have analyzed a series of optimization prob- 
lems which may be defined as the minimization of a separable function 
on the base polyhedron. In this section, we consider algorithms to solve 
these problems; most of them are based on the availability of an effi- 
cient algorithm for maximizing linear functions (greedy algorithm from 
Prop. 12. 2p . We focus on three types of algorithms. The algorithm we 
present in Section [6TT] is a divide- and-conquer non- approximate method 
that will recursively solve the separable optimization problems by defin- 
ing smaller problems. This algorithm requires to be able to solve sub- 
modular function minimization problems of the form min^i F{A)—t{A), 
where t G M^, and is thus applicable only when such algorithms are 
available (such as in the case of cuts, fiows or cardinality-based func- 
tions). The next two sets of algorithms are iterative methods for con- 
vex optimization on convex sets for which the support function can be 
computed, and are often referred to as "Frank- Wolfe" algorithms. The 
min-norm-point algorithm that we present in Section [6^2] is dedicated to 
quadratic functions and converges after finitely many operations (but 
with no complexity bounds), while the conditional gradient algorithms 
that we consider in Section 16.31 do not exhibit finite convergence but 
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have known convergence rates. 

Note that, from the use of the algorithms presented in this section, 
we can derive a series of operations on the two polyhedra, namely line 
searches and orthogonal projections (see also |103) ). 

6.1 Decomposition algorithm for proximal problems 

We now consider an algorithm for proximal problems, which is based 
on a sequence of submodular function minimizations. It is based on a 
divide-and-conquer strategy. We adapt the algorithm of [55] and [39t 
Sec. 8.2]. Note that it can be slightly modified for problems with non- 
decreasing submodular functions [55J (otherwise, Prop. 15.71 may be 
used) . 

For simplicity, we consider strictly convex differentiable functions 
ip* , j = 1, . . . ,p, (so that the minimum in s is unique) and the following 
recursive algorithm: 

(1) Find the unique minimizer t G of X^jgy ipj such that 
t{V) = F{V). 

(2) Minimize the submodular function F — t, i.e., find the largest 
AcV that minimizes ^(^4) - t{A). 

(3) U A = V, then t is optimal. Exit. 

(4) Find a minimizer sa of XljeA V'j (~Sj) over s in the base 
polyhedron associated to Fa, the restriction of F to A. 

(5) Find the unique minimizer Sy\A of J2jeV\A''J^ji~^j) '^^^^ ^ 
in the base polyhedron associated to the contraction F^ of 
F on A, defined as F^{B) = F{AuB)-F{A), for B C V\A. 

(6) Concatenate sa and sy\a- Exit. 

The algorithm must stop after at most p iterations. Indeed, if A ^ V 
in step 3, then we must have A ^ (indeed, A = implies that 
t G PiF), which in turns implies that A = V because by construction 
t{V) = F{V), which leads to a contradiction). Thus we actually split V 
into two non-trivial parts A and V\A. Step 1 is a separable optimization 
problem with one linear constraint. When ip* is a quadratic polynomial, 
it may be obtained in closed form; more precisely, one may minimize 
l\\t - z\\l subject to tiV) = F{V) by taking t = + z- i^z. 
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Proof of correctness. Let s be the output of the algorithm. We first 
show that s G B[F). We have for any B gV: 

s{B) = s{B n A) + s{B n {V\A)) 

^ F{B nA) + F{A UB)- F{A) by definition of sa and Sv\a 
^ F(B) by submodularity. 

Thus s is indeed in the submodular polyhedron P{F). Moreover, we 
have s{V) = sa{A) + sv\a{V\A) = F{A) + F{V) - F{A) = F{V), i.e., 
s is in the base polyhedron B{F). 

Following [55], we now construct a second base s G B{F) as fol- 
lows: SA is the minimizer of X]jg^'0j(~'Sj) over sa in the base polyhe- 
dron associated to the submodular polyhedron P{Fa) n {sa ^ ^a}- 
Prom Prop. IB. 51 the associated submodular function is Ha{B) = 
mmccBF{C) + t{A\C). We have Ha{A) = mmccAF{C) - t{C) + 
t{A) = F{A) because A is the largest minimizer oi F — t. Thus, the 
base polyhedron associated with Ha is simply B{Fa) n {sa ^ ^a}- 

Morover, we define Sy\A as the minimizer of Ylij!^v\A'^*ji~^j) ^^^^ 
the base polyhedron B{J^) where we define the submodular function 

on V\A as follows: J^{B) = lainc^B F{C{JA)-F{A)-t{C)+t{B). 
Then - t is non-decreasing and submodular (by Proposition IB.6P . 
Moreover, J^{V\A) = F{V) - F{A) and ^ F^. Finally B{F^) n 
{sv\A^ty\A] = B{J'^). 

We now show that s is optimal for the problem. Since s has a higher 
objective value than s (because s is minimized on a larger set), the base 
s will then be optimal as well. In order to show optimality, we need to 
show that if w denotes the vector of gradients (i.e., Wk = — (V'fc)'(~Sfc)), 
then s is a maximizer of s i— )■ iv^ s over s G B{F). Given Prop. 14. 2^ we 
simply need to show that s is tight for all level sets {rt; ^ a}. Since, by 
construction Sk ^ Sq for all s G A and q G V\A, level sets are included 
in A or in V\. Thus, by optimality of sa and Sy\Aj these level sets are 
indeed tight, hence optimality. 

Note finally that similar algorithms may be applied when we restrict 
s to be integers (see, e.g., [55| [62]). 
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6.2 Iterative algorithms - Exact minimization 

In this section, we focus on quadratic separable problems. Note that 
modifying the submodular function by adding a modular terrr0, we can 
consider tpk = ^w^. As shown in Prop. EH minimizing f{w) + ^\\w\\2 
is equivalent to minimizing |||s||2 such that s G B{F). 

Thus, we can minimize f{w) + ^II^Hl by computing the minimum 
^2-norm element of the polytope B{F), or equivalently the orthogo- 
nal projection of onto B{F). Although B{F) may have exponentially 
many extreme points, the greedy algorithm of Prop. [22] allows to max- 
imize a linear function over B{F) at the cost of p function evaluations. 
The minimum-norm point algorithm of |135) is dedicated to such a sit- 
uation, as outlined by [50]. It turns out that the minimum-norm point 
algorithm can be interpreted as a standard active set algorithm for 
quadratic programming, which we now describe. 

Prank Wolfe algorithm as an active set algorithm. We consider 
m points xi, . . . , Xm in and the following optimization problem: 



1 



m 



mm — II Vi^ 



2 

such that 77 ^ 0, 77 1 = 1. 

2 



In our situation, the vectors will be the extreme points of B{F), 
i.e., outputs of the greedy algorithm, but they will always be used 
implicitly through the maximization of linear functions over B{F). We 
will exactly apply the primal active set strategy outlined in Section 16.4 
of |114| , which is exactly the algorithm of |135j . The active set strategy 
hinges on the fact that if the set of indices j E J for which rjj > is 
known, the solution ijj may be obtained in closed form by computing 
the affine projection on the set of points indexed by / (which can be 
implemented by solving a positive definite linear system, see step 2 
in the algorithm below). Two cases occur: (a) If the affine projection 
happens to have non- negative components, i.e., 7/j ^ (step 3), then 
we obtain in fact the projection onto the convex hull of the points 



^ Indeed, we have i ||ii) — z|||+/(?i)) = i \\w\\2 + {f{w) — w^z) + ^ ll^ll^i which corresponds (up 
to the irrelevant constant term ^|l2|j|) to the proximal problem for the Lovasz extension 
of A ^ F{A) - z{A). 
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indexed by J, and we simply need to check optimality conditions and 
make sure that no other point needs to enter the huh (step 5), and 
potentially add it to go back to step 2. (b) If the projection is not 
in the convex hull, then we make a move towards this point until we 
exit the convex hull (step 4) and start again at step 2. We describe in 
Figure EH an example of several iterations. 

(1) Initialization: We start from a feasible point ry G such 
that rj^l = 1, and denote J the set of indices such that rjj > 
(more precisely a subset of J such that the set of vectors 
indexed by the subset is linearly independent). Typically, we 
select one of the original points, and J is a singleton. 

(2) Projection onto afRne hull: Compute O the unique min- 
imizer i|| Y^j^j 'iljXj\\'^ such that l^rjj = 1, i.e., the orthogo- 
nal projection of onto the affine hull of the points (xj)jgj. 

(3) Test membership in convex hull: If (^j ^ (we in fact 
have an element of the convex hull), go to step 5 

(4) Line search: Let a G [0, 1) be the largest a such that rjj + 
a{Cj — r]j) ^ 0. Let K the sets of j such that r]j + a{Cj—r]j) = 
0. Replace J by J\K and r/ by r/ + a(C — "?/), and go to step 
2. 

(5) Check optimality: Let y = ^j^jVj^j- Compute a mini- 
mizer i of y^Xi. If y~^Xi = y~^ri, then ry is optimal. Otherwise, 
replace J by J U {j}, and go to step 2. 

The previous algorithm terminates in a finite number of iterations 
because it strictly decreases the quadratic cost function at each itera- 
tion; however, there is no known bounds regarding the number of iter- 
ations (see more details in |114j ). Note that in pratice, the algorithm is 
stopped after either (a) a certain duality gap has been achieved — given 
the candidate ry, the duality gap for Tj is equal to ~l~maxjg|]^^ Xj, 
where x = X^I^i Vi^i (™ the context of application to orthogonal pro- 
jection on B{F), following Section one may get an improved duality 
gap by solving an isotonic regression problem); or (b), the affine pro- 
jection cannot be performed reliably because of bad condition number 
(for more details regarding stopping criteria, see |135j ). 
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Fig. 6.1: Illustration of Frank- Wolfe minimum norm point algorithm: 
(a) initialization with J = {2} (step 1), (b) check optimality (step 5) 
and take J = {2,5}, (c) compute affine projection (step 2), (d) check 
optimality and take J = {1,2,5}, (e) perform line search (step 3) and 
take J = {1,5}, (f) compute affine projection (step 2) and obtain 
optimal solution. 

6.3 Iterative algorithms - Approximate minimization 

In this section, we describe an algorithm strongly related to the 
minimum- norm point algorithm presented in Section 17.21 This "condi- 
tional gradient" algorithm is dedicated to minimization of any convex 
smooth functions on the base polyhedron. Following the same argument 
than for the proof of Prop. 15.11 this is equivalent to the minimization 
of any convex strictly convex separable function regularized by the 
Lovasz extension. As opposed to the mininum-norm point algorithm, 
it is not convergent in finitely many iterations; however, as shown in 
Appendix IA.21 it comes with approximation guarantees. 

When performing optimization on the convex set B{F), it is usu- 
ally necessary to bound the convex set in some way. In our situation. 
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the base polyhedron is mcluded in the hyper-rectangle nA;ev[-^(^) ~ 
F {V\{k}) , F {{k})] (as a consequence of the greedy algorithm applied 
to Ij^i and — Ij^j). We denote by the length of the interval for 
variable k, i.e., = F{{k}) + F{V\{k}) - F{V). 

In this section, we also denote = 
^^^^max{|F({A;})|,|F(y\{fc})|}2. We then have that B{F) is 
included in the ^2-ball of center zero and radius D. 



Conditional gradient algorithms. If (7 is a smooth convex function 
defined on MP with Lipschitz-continuous gradient (with constant L), 
then the conditional gradient algorithm is an iterative algorithm that 
will (a) start from a certain sq G B{F), and (b) iterate the following 
procedure for t ^ 1: find a minimizer st-i over the (compact) polytope 
B{F) of the Taylor expansion of g around st-i, i.e, s 1— )• g{st-i) + 
g'{st-i)~^{s — st-i), and perform a step towards st-i, i.e., compute 

St = UJt-lSt-l + (1 - UJt-l)st-l. 

There are several strategies for computing ujt-i- The first is to take 
oJt-i = 1/t |41t 172]. while the second one is to perform a line search on 
the quadratic upper-bound on g obtained from the L-Lipschitz conti- 
nuity of g (see App endix I A . 2 1 for details). They both exhibit the same 
upper bound on the sub-optimality of the iterate st, together with 
g' {wt) playing the role of a certificate of optimality. More precisely, we 
have for the line search method (see Appendix IA.2p : 

g{st) - min g{s) ^ hX^LlA 

and the computable quantity maXg^B{F) g'{st)~^ {s — st) provides a cer- 
tificate of optimality, that is, we always as g{st) — mmgfzB{F) 9i^) ^ 
maXsgB(j7) g' {st)~^ {s — St) , and the latter quantity has (up to constants) 
the same convergence rate. Note that while this certificate comes with 
an offline approximation guarantee, it can be significantly improved, 
following Section 15.41 by solving an appropriate isotonic regression 
problem (see simulations in Section ^ . 

In Figure [6^ we consider the conditional gradient algorithm (with 
line search) for the quadratic problem considered in Section [6.21 These 
two algorithms are very similar as they both consider a sequence of 
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extreme points of B{F) obtained from the greedy algorithm, but they 
differ in the following way: the min-norm-point algorithm is finitely 
convergent but with no convergence rate, while the conditional gradi- 
ent algorithm is not finitely convergent, but with a convergence rate. 
Moreover, the cost per iteration for the min-norm-point algorithm is 
much higher as it requires linear system inversions. In context where 
the function F is cheap to evaluate, this may become a computational 
bottleneck. 

Subgradient descent algorithm. Under the same assumption as 
before, the Fenchel conjugate of 5 is strongly convex with constant 1/L 
(see Appendix lA.2l for the definition of strong convexity). Moreover, we 
may restrict optimization to the ball of radius D (if g is D-Lipschitz- 
continuous). We can thus apply the subgradient descent algorithm de- 
scribed in Appendix IA.21 with iteration wt = wt-i — j[{g*y{wt-i) + 
st-i] (where st~i is a subgradient of / at wt-i, i.e., a maximiser of 
s~^wt-i over s G B{F)), and obtain a convergence rate 



The convergence rate is similar to the one for the conditional gradient, 
but these are only upper bounds, and, as shown in the experiments, 
the conditional gradient is faster. Note moreover, that when applied 
to quadratic functions, the subgradient algorithm is then equivalent to 
applying a conditional gradient algorithm with no line search to the 
dual problem (and a constant step size); indeed, we may rewrite the 
recursion as {—wt) = {—wt-i) + — {—wt-i)], i.e., —wt is the 

iterate of a conditional gradient algorithm. 
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Fig. 6.2: Illustration of Frank-Wolfe conditional gradient algorithm: 
starting from the inialization (a), in steps (b),(d),(f),(h), an extreme 
point on the polytope is found an in steps (c),(e),(g),(i), a line search 
is performed. Note the oscillations to converge to the optimal point 
(especially compared to Figure 16. 1|) . 



7 



Submodular function minimization 



Several generic algorithms may be used for the minimization of a sub- 
modular function. They are all based on a sequence of evaluations of 
F{A) for certain subsets A C V. For specific functions, such as the ones 
defined from cuts or matroids, faster algorithms exist (see, e.g., |51tl62j. 
Section [32] and Section [TSj) . For other special cases, such as functions 
obtained as the sum of functions that depend on the intersection with 
small subsets of V, faster algorithms also exist (see, e.g., [130^ 183]). 

Submodular function minimization algorithms may be divided in 
two main categories: exact algorithms aim at obtaining a global mini- 
mizer, while approximate algorithms only aim at obtaining an approx- 
imate solution, that is, a set A such that F{A) — mmBcV F{B) ^ e, 
where e is as small as possible. Note that if e is less than the minimal 
absolute difference 6 between non-equal values of F, then this leads 
to an exact solution, but that in many cases, this difference S may be 
arbitrarily small. 

An important practical aspect of submodular function minimization 
is that most algorithms come with online approximation guarantees; 
indeed, because of a duality relationship detailed in Section I7.H in a 
very similar way to convex optimization, a base s G B{F) may serve as a 
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certificate for optimality. Note that all algorithms except the minimum- 
norm-point algorithm from Section [7^2] come with offline approximation 
guarantees. 

In Section 17.31 '^6 review combinatorial algorithms for submodular 
function minimization that come with complexity bounds. Those are 
however not used in practice in particular due to their high theoretical 
complexity (i.e., 0{p^)), except for the particular class of posimodular 
functions, where algorithms scale as O(p^) (see Section [73]) . 

In Section 17.51 we describe optimization algorithms based on sepa- 
rable optimization problems regularized by the Lovasz extension. Using 
directly the equivalence presented in Prop. 12. 4^ we can minimize the 
Lovasz extension / on the hypercube [0, 1]^ using subgradient descent 
with approximate optimality for submodular function minimization of 
0{l/\/t) after t iterations. Using quadratic separable problems, we can 
use the algorithms of Section 16.31 to obtain new submodular function 
minimization algorithms with convergence of the convex optimization 
problem at rate 0{l/t), which translates through the analysis of Sec- 
tion[5]to the same convergence rate of 0(1/ \/t) for submodular function 
minimization, although with improved behavior and better empirical 
performance (see Section 1731 and Section [9]). 

We also consider in Section 17.21 a formulation based on quadratic 
separable problem on the base polyhedron, but using the minimum- 
norm-point algorithm described in Section 16. 2t this is one of the fastest 
in practice, but it comes with no complexity bounds. 

Note that maximizing submodular functions is a hard combinato- 
rial problem in general. However, when maximizing a non-decreasing 
submodular function under a cardinality constraint, the simple greedy 
method allows to obtain a (1 — l/e)-approximation (see more 

details in Section [8]). 

7.1 Minimizers of submodular functions 

In this section, we review some relevant results for submodular function 
minimization (for which algorithms are presented in next sections). 



Proposition 7.1. (Lattice of minimizers of submodular func- 
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tions) Let F be a submodular function such that F(0) = 0. The set 
of minimizers of F is a lattice, i.e., if A and B are minimizers, so are 
AUB and AnB. 



Proof. Given minimizers A and B of F, then, by submodularity, we 
have 2mmccvF{C) ^ F{A U B) + F{A n B) ^ F{A) + F{B) = 
2 minccv F{C), hence equahty in the first inequality, which leads to 
the desired result. □ 

The following proposition shows that some form of local optimality 
implies global optimality. 



Proposition 7.2. (Property of minimizers of submodular func- 
tions) Let F be a submodular function such that F{0) = 0. The 
set ^ C y is a minimizer of -F on 2^ if and only if A is a min- 
imizer of the function from 2^ to M defined as B C A F{B), 
and if is a minimizer of the function from 2^^"^ to M defined as 
B C V\A ^ F{B UA)- F{A). 



Proof. The set of two conditions is clearly necessary. To show that it is 
sufficient, we let B cV.we have: F{A)+F{B) ^ F{AuB)+F{AnB) > 
F(A) + F(A), by using the submodularity of F and then the set of two 
conditions. This implies that F(A) ^ F{B), for all B C V, hence the 
desired result. □ 

The following proposition provides a useful step towards submod- 
ular function minimization. In fact, it is the starting point of most 
polynomial-time algorithms presented in Section I7.3[ Note that sub- 
modular function minimization may also be obtained from minimizing 
II sill over s in the base polyhedron (see Section [5] and Section [5^ . 



Proposition 7.3. (Dual of minimization of submodular func- 
tions) Let -F be a submodular function such that F{0) = 0. We have: 

minF(^)= max S-(V) = F(V) - min ||s||i, (7.1) 
ACV s&B{F) s&B(F) 
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where (s-)fc = min{sfe,0} for k G V. Moreover, given A C V and 
s £ B{F), we always have F{A) ^ S-.{V) with equahty if and only if 
{s < 0} C ^ C {s ^ 0} and A is tight for s, i.e., s{A) = F{A). 
We also have 

minF(^)= max s(V). (7.2) 

Agv seP{F), s^o 

Moreover, given A C V and s £ PiF) such that s ^ 0, we always have 
F{A) ^ s{V) with equality if and only if {s < 0} C A and A is tight 
for s, i.e., s{A) = F{A). 

Proof. We have, by convex duality, and Props. [23] and HTT} 

m.mF(A) = min f(w) 
Acv we[o,i]p 

= min max w'^ s = max min s = max S-(V). 

«>e[0,l]P seB{F) seB{F)w&[0,l]P s€B{F) 

Strong duality indeed holds because of Slater's condition ([0,1]^ has 
non-empty interior). Since s(V) = F(V) for all s G B(F), we have 
S-{V) = F{V) — \\s\\i, hence the second equality. 
Moreover, we have, for all A C ^ and s € B{F): 

F{A) ^ s{A) = s{An{s < 0})+s{An{s > 0}) ^ s{An{s < 0}) ^ s^{V), 

with equality if there is equality in the three inequalities. The first one 
leads to s{A) = F{A). The second one leads to ACi {s > 0} = 0, and 
the last one leads to {s < 0} C A. Moreover, 

max s(V) = max min ly — w'^ s = min max s~^ly—w~^s 
seP(F), s^O seP{F) w^O w^O seP(F) 

= min f{lv — w) because of property (c) in Prop. HTTl 

= min F(A) because of Prop. 12.41 

agv 

Finally, given s £ PiF) such that s ^ and A C V, we have: 

F{A) ^ s{A) = s{A r\{s< 0}) ^ s{V), 
with equality if and only if A is tight and {s < 0} C ^. □ 
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7.2 Minimum-norm point algorithm 

Prom Eq. (j5.5p or Prop. [5^ we obtain that if we know how to minimize 
/(i(;) + ^||ti;||2, or equivalently, minimize ^||s||| such that s G B{F), then 
we get ah minimizers of F from the negative components of s. 

We can then apply the minimum-norm point algorithm detailed 
in Section [612] to the vertices of B(F), and notice that step 5 does not 
require to list all extreme points, but simply to maximize (or minimize) 
a linear function, which we can do owing to the greedy algorithm. The 
complexity of each step of the algorithm is essentially 0{p) function 
evaluations and operations of order 0{p^). However, there are no known 
upper bounds on the number of iterations. Pinally, we obtain s G B{F) 
as a convex combination of extreme points. 

Note that once we know which values of the optimum values s (or 
w) should be equal, greater or smaller, then, we obtain in closed form 
all values. Indeed, lei vi > V2 > ■ ■ ■ > Vm the m different values taken 
by w, and Ai the corresponding sets such that Wk = vj for k £ Aj. 
Since we can express f{w) + ^\\w\\2 = X^JLi {vj[F{Ai U • • • U Aj) — 
F{Ai U • • • U Aj^i)] + ^^c|}, we then have: 

_ -F{Ai U---U Aj) + F{Ai U • • • U Aj_i) 

which allows to compute the values Vj knowing only the sets Aj (i.e., 
the ordered partition of constant sets of the solution). This shows in 
particular that minimizing f{w) \\\w\\\ may be seen as a certain 
search problem over ordered partitions. 

7.3 Combinatorial algorithms 

Most algorithms are based on Prop. 17.31 i.e., on the identity 
va.va.Ac_v = '^<^^scB{F) •S-(^)- Combinatorial algorithms will usu- 
ally output the subset A and a base s G B{F) such that A is tight for 
s and {s < 0} C j4 C {s ^ 0}, as a certificate of optimality. 

Most algorithms, will also output the largest minimizer A of -F, or 
sometimes describe the entire lattice of minimizers. Best algorithms 
have polynomial complexity |123ll70tlll6) . but still have high complex- 
ity (typically 0(jp^) or more). Most algorithms update a sequence of 
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convex combination of vertices of B{F) obtained from the greedy al- 
gorithm using a specific order. Recent algorithms |73j consider efficient 
reformulations in terms of generalized graph cuts. 

Note here the difference between the combinatorial algorithm which 
maximizes S-iV) and the ones based on the minimum-norm point al- 
gorithm which maximizes — ^||s||2 over the base polyhedron B{F). In 
both cases, the submodular function minimizer A is obtained by taking 
the negative values of s. In fact, the unique minimizer of |||s||2 is also 
a maximizer of S-{V), but not vice- versa. 

7.4 Minimizing symmetric posimodular functions 

A submodular function F is said symmetric if for all B C V, 
F{V\B) = F{B). By applying submodular ity, get that 2F{B) = 
F{V\B) + F{B) ^ F{V) + F{0) = 2F{0) = 0, which implies that 
F is non-negative. Hence its global minimum is attained at V and 0. 
Undirected cuts (see Section 13. 2p are the main classical examples of 
such functions. 

Such functions can be minimized in time O(p^) over all non-trivial 
(i.e., different from and V) subsets of V through a simple algorithm 
of Queyranne |117j . Moreover, the algorithm is valid for the regular 
minimization of posimodular functions |102j . i.e., of functions that sat- 
isfies 

V^, B cV, F{A) + F{B) ^ F{A\B) + F{B\A). 

These include symmetric submodular functions as well as non- 
decreasing modular functions, and hence the sum of any of those (in 
particular, cuts with sinks and sources, as presented in Section [312]) . 
Note however that this does not include general modular functions (i.e., 
with potentially negative values); worse, minimization of functions of 
the form XF{A) — z{A) is provably as hard as general submodular func- 
tion minimization |117j . Therefore this O(p^) algorithm is quite specific 
and may not be used for solving proximal problems with symmetric 
functions. 
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7.5 Approximate minimization through convex optimization 

In this section, we consider two approaches to submodular function 
minimization based on iterative algorithms for convex optimization: a 
direct approach, which is based on minimizing the Lovasz extension 
directly on [0, 1]^ (and thus using Prop. [T^ . and an indirect approach, 
which is based on quadratic separable optimization problems (and thus 
using Prop. [53]) . All these algorithms will access the submodular func- 
tion through the greedy algorithm, once per iteration, with minor op- 
erations inbetween. 

Restriction of the problem. Given a submodular function F, if 
F{{k}) < 0, then k must be in any minimizer of F, since, because of 
submodularity, if it is not, then adding it would reduce the value of F. 
Similarly, if F{V) — F{V\{k}) > 0, then k must in the complement of 
any minimizer of F. Thus, if denote ^min the set of k £ V such that 
F{{k}) < and A 

max the complement of the set of k G V such that 
F{V) — F{V\{k}) > 0, then we may restrict the minimization of F to 
subset A such that ^min C ^ C ^max- This is equivalent to minimizing 
the submodular function A i— F{A U Amin) — -P'(^min) on ^max\^min- 
From now on, (mostly for the convergence rate described below) 
we assume that we have done this restriction and that we are now 
minimizing a function F so that for all k G V, F{{k}) ^ and 
F{V)-F{V\{k}) ^ 0. We denote by Ofc = F{{k}) + F{V\{k})-F{V), 
which is non-negative by submodularity. Note that in practice, this re- 
striction can be seamlessly done by starting regular iterative methods 
from specific starting points. 

Direct approach. From Prop. 12.41 we can use any convex opti- 
mization algorithm to minimize f{w) on w £ [0,1]*^. Following |60j . 
we consider subgradient descent with step-size 7^ = (where 
= X^^jgy i.e., (a) starting from any wq G [0, 1]^, we iterate (a) 
the computation of a maximiser st-i of wJ_iS over s G B{F), and (b) 
the update wt = ^[o,i]p [wt-i — ^^^st-i] , where n[o^i]p is the orthogo- 
nal projection onto the set [0, 1]^ (which may done by thresholding the 
components independently). 
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The following proposition shows that in order to obtain a certified 
e- approximate set B, we need at most ^^r- iterations of subgradient 
descent (whose complexity is that of the greedy algorithm to find a 
base s G B{F)). 

Proposition 7.4. (Submodular function minimization by sub- 
gradient descent) After t steps of projected subgradient descent, 
among the p sup-level sets of wt, there is a set B such that F{B) — 
miu/icV' -f'(^) ^ • Moreover, we have a certificate of optimal- 

ity = tttELoS". so that F{B) - {st)-{V) ^ with = 

n=,4- 

Proof. Given an approximate solution w so that ^ fiw) — /* ^ e, 
with /* = min^cy-^(^) = ^^^we[o,i]p f{^)^ can sort the elements 
of w in decreasing order, i.e., 1 ^ Wj^ ^ • • • ^ wj^ ^ 0. We then have, 
with Bk = {ji,...,jk}, 
p-i 

/H-r = ^(m) -/*)(«;,, 
fc=i 

+{F{v) - niwj^ - 0) + (F(0) - mi - w,,). 

Thus, as the sum of positive numbers, there must be at least one 5^ 
such that F{Bk) — f* < e. Therefore, given w such that < f{w) — f* ^ 
£, there is at least on the sup-level set of w which has values for F which 
is e-approximate. 

The subgradients of /, i.e., elements s of B{F) are such that F{V) — 
F{V\{k}) ^ Sfc ^ F({/c}).This implies that / is Lipschitz-continuous 
with constant D, with = Yl^=i'^'k- Since [0,1]^ is included in an 
^2-ball of radius ^/p/2, results from Appendix IA.2I imply that we may 
takee = Mor cover, as shown in the Appendix lA.2l the average of 

all subgradients provides a certificate of duality with the same known 
convergence rate (i.e., if we use it as a certificate, it may lead to much 
better certificates than the bound actually suggests). 

Finally, if we replace the subgradient iteration by wt = 
n[o,i]p [wt-i — Diag(a)~^-^st_i] , then this corresponds to a subgra- 
dient descent algorithm on the function w i— t- f(Diag{a)^^^'^w) on the 



7.5. Approximate minimization through convex optimization 87 



set nfcev[0''^fc ]' ^^'^ which the diameter of the domain and the Lips- 
chitz constant are equal to ( X^fceV '^k) ■ We thus obtain the improved 
convergence rate of ^^^=^. □ 

The previous proposition rehes on the most simple algorithms for 
convex optimization, subgradient descent, which is applicable in most 
situations; however, its use is appropriate because the Lovasz extension 
is not differentiable, and the dual problem is also not differentiable. We 
now consider separable quadratic optimization problems whose duals 
are the maximization of a concave quadratic function on B{F), which 
is smooth. We can thus use the conditional gradient algorithm, with a 
better convergence rate; however, as we show below, when we thresh- 
old the solution to obtain a set A, we get the same scaling as before 
(i.e., 0(l/\/t)), with an improved empirical behavior. See below and 
experimental comparisons in Sectional 

Conditional gradient. We now consider the set-up of Section [5] with 
i>k{wk) = 2ir'^h thus V'fc('5fe) = "^•Sfc for certain constants L^, ^ 
0. That is, e consider the conditional gradient algorithm studied in 
Section lOl and Appendix IA.2t with g{s) = 2 SfceV (^) starting 

from any base sq S B{F), iterate (b) the greedy algorithm to obtain 
a mininizer st-i of {st-i o L)^ s with respect to s G B{F), and (c) 
perform a line search to minimize with respect to w E [0, 1], [st-i + 
uj{st-i - st-i)]^ Diag(L)[sf_i + uj{st-i - st-i)]. 

Let Qfc = F{{k}) + F{V\{k})-F{V), /c = 1, . . . be the widths of 
the hyper-rectangle enclosing B{F). The following proposition shows 
how to obtain an approximate minimizer of F. 

Proposition 7.5. (Submodular function minimization by con- 
ditional gradient descent) After t steps of the conditional gradient 
method described above, among the p sub-level sets of L o sj, there is 

a set B such that F{B) - min^cy i^(^) ^ T^V ^'^s""^ - ELi A" 
Moreover, st acts as a certificate of optimality, so that F{B) — 
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Proof. The convergence rate analysis of the conditional gradient 
method leads to an e-approximate solution with e ^ ^^^7^:7^^ • From 
Eq. dSni), if we assume that {F + i;'{a)){{w ^ a})- {s + tp'{a))-{V) > 
£/2r] for all a G [—7],ri], then we obtain (with ip'{a)k = ■^)- 

e > |(F + ^l>'{a)){{w ^ a}) - {s + V''(a))„(y)|da > e, 

which is a contradiction. Thus, there exists a S [— r/, ry] such that ^ 
{F + ilj'{a))[{w ^ a}) - (s + '4)'{a))-{V) ^ e/2??. Let A* a minimizer 
of F. We have: 

F({'u; ^ a}) + '0'(a)({u; ^ a]) ^ F(^*) + V'(a)(^*) + e/2r?, 
leading to F({w ^ a}) ^ F(yl*) + YX=il; + ^Z^^- choosing 
ri = ^-1 , we obtain F({u; ^ a}) ^ + ^fELi^fc' ^ 

F(A*) + •y^ ^2t/_j!j y~^ Z]fc=i -^fc ^- This leads to a an approximation of 

1 



2 fct^^' 



□ 



In the previous proposition, two natural choices for Lk emerge. 
The traditional choice = 1, which leads to a convergence rate 

of \ 2{t+i) ' ^'^^ -^fc Of "fc ) leading to a convergence rate of 

^ Z]fc=i Q^fc^ ^l^i ■ Here the convergence rate is the same as for sub- 
gradient descent. See Section [9] for an empirical comparison, showing a 
better behavior for the conditional gradient method. As for subgradi- 
ent descent, this algorithm provides certificates of optimality. Moreover, 
when offline (or online) certificates of optimality ensures that we an ap- 
proximate solution, because the problem is strongly convex, we obtain 
also a bound ^/le on ||st — s*||2 where s* is the optimal solution. In 
the case where = 1 for all k G V, this in turns allows to ensure that 
all indices k such that st > V^e cannot be in a minimizer of F, while 
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those indices k such that st < —\/2e have to be in a minimizer, which 
can ahow efficient reduction of the search space (although these have 
not been implemented in the simulations in Sectional). 

Alternative algorithms for the same separable optimization prob- 
lems may be used, i.e., conditional gradient without line search [721I10], 
with similar convergence rates and behavior, but sometimes worse em- 
pirical peformance, and with a weaker link with the minimum-norm- 
point algorithm. Another alternative is to consider projected subgra- 
dient descent in w, with the same convergence rate (because the ob- 
jective function is then strongly convex. Note that as shown before 
(Section 16. 3p . it is equivalent to a conditional gradient algorithm with 
no line search. 

Smoothing for special case of submodular functions. For some 
specific submodular functions, it is possible to use alternative optimiza- 
tion algorithms. As outlined in \T50\ . this is appropriate when F may 
be written as F{A) = XlceS n G), where Fq : G ^ R is sub- 

modular, the set S of subsets of V is composed of small subsets, and 
the Lovasz extensions Fq are explicit enough so that one may compute 
a convex smooth (with Lipschitz-constant of the gradient less than L) 
approximation of Fq with uniform approximation error of 0(1/L). In 
this situation, the Lovasz extension of F may be approximated within 
0(1/L) by a smooth function on which an accelerated gradient tech- 
nique such as described in Section 15.11 may be used with convergence 
rate 0{L/t^) after t iterations. When choosing L = 1/t (thus with a 
fixef horizon), this leads to an approximation guarantee for submodu- 
lar function minimization of the form 0{l/t), instead of 0(l/t^) in the 
general case. 
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Other submodular optimization problems 



While submodular function minimization may be solved in polyno- 
mial time (see Section [7|), submodular function maximization (which 
includes the maximum cut problem) is NP-hard. Nevertheless, submod- 
ularity may be used in order to obtain some local or global guarantees 
(see Section [HTT] and Section [HT^ or to derive local descent algorithms 
for more general problems (see Section [8.3p . 

8.1 Submodular function maximization 

In this section, we consider a submodular function and the maximiza- 
tion problem: 

maxF(^). (8.1) 

AcV ^ 

This problem is known to be NP-hard (note that it includes the max- 
imum cut problem) [IS]. However, several approximation algorithms 
exist with theoretical guarantees, in particular when the function is 
known to be non-negative (i.e., with non-negative values F{A) for all 
A gV). For example, it is shown in [36] that selecting a random subset 
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already achieves at least 1/4 of the optimal valuqj, while local search 
techniques achieve at least 1/2 of the optimal value. 

Local search algorithm. Given any set A, simple local search al- 
gorithms simply consider all sets of the form A U {k} and A\{k} and 
select the one with largest value of F. If this value is lower than F, 
then the algorithm stops and we are by definition at a local minimum. 
While these local minima do not lead to any global guarantees in gen- 
eral, there is an interesting added guarantee based on submodularity, 
which we now prove (see more details in [53]). 



Proposition 8.1. (Local minima for submodular function min- 
imization) Let F be a submodular function and A C V such that for 
all k£ A, F{A\{k}) ^ F{A) and for ah k £ V\A, F{Au{k}) ^ F{A). 
Then for all S C A and all B D A, F{B) F[A). 



Proof, li B = AVJ {ii, . . . , iq], then 

F{B)-F{A) = ^FiAu{h,...,ij})-F{Au{h,...,ij-i}) 

< j2HAu{ij})-F{A)^0, 

which leads to the first result. The second one may be obtained from 
the first one applied to A^ F(y\A) - F(y). □ 

Note that branch-and-bound algorithms (with worst-case exponential 
time complexity) may be designed that specifically take advantage of 
the property above 

Formulation using base polyhedron. Given F and its Lovasz ex- 
tension /, we have (the first equality is true since maximization of 



^ Such a result for a random subset shows that having theoretical guarantees do not neces- 
sarily imply that an algorithm is doing anything subtle. 
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convex function leads to an extreme point |121j ): 





max max w s because of Prop. 12.21 

we[0,l]P seB(F) 



= max s+(V) = max -(s + |s 

seB(F) seB{F) 2 




Thus submodular function maximization may be seen as finding the 
maximum £i-norm point in the base polyhedron (which is not a convex 
optimization problem) . See an illustration in Figure 18.11 

8.2 Submodular function maximization with cardinality con- 



In this section, we consider a specific instance of submodular maxi- 
mization problems, with theoretical guarantees. 

Greedy algorithm for non-decreasing functions. Submodular 
function maximization provides a classical example where greedy al- 
gorithms do have performance guarantees. We now consider a non- 
increasing submodular function F and the problem of minimizing F{A) 
subject to the constraint 1^41 ^ k, for a certain k. The greedy algo- 
rithm will start with the empty set A = and iteratively add the 
element k £ V\A such that F{A U {A;}) — F{A) is maximal. It has an 
(1 — l/e)-performance guarantee [111) (note that this guarantee cannot 
be improved in general, as it cannot for set cover, see |44j ) : 

Proposition 8.2. (Performance guarantee for submodular 
function maximization) Let F be a non-decreasing submodular 
function. The greedy algorithm for maximizing F{A) subset to |^| ^ k 
outputs a set A such that 



straints 



F(A) ^ [1 - (1 - l/kf] 



max F(B) ^(1-l/e) max F(B). 

BCV, \B\^k BCV, \B\^k 
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Proof. We follow the proof of [HIl ^36\. Let A* = {bi,...,bk} be 
a maximizer of F with k elements, and aj the j-th element selected 
during the greedy algorithm. We consider pj = F{{ai, . . . ,aj}) — 
F{{ai, . . . , aj_i}). We have for all j G {1, . . . , k}: 

F{A*) 

^ F{A* U Aj-i) because F is non-decreasing, 

k 

= F{Aj_i) + ^ [F{A,.i U {6i, . . . , bi}) - F{A,^i U . . . , h^i})] 

i=l 

k 

^ F{Aj_i) + ^ U {bi})-F{Aj_i)] by submodularity, 

j=i 

^ F(j4j_i) + A;/9j by definition of the greedy algorithm, 
= + /cpj. 

i=l 

We can now simply minimize XllLi subject to the k constraints de- 
fined above (plus pointwise positivity), i.e., YliZl Pi + ^Pj ^ F{A*). It 
turns out, that taking all inequalities as equalities leads to an invert- 
ible linear system whose solution is pj = [k — iy~^k'^ ^ 0, leading to 
Sj=i Pi = Sj=i(l ~ l/ky~^k~^ = (1 — l/ky, hence the desired result 
since (1 — l/k)'' = exp(A;log(l — l/k)) ^ exp(A; x {—1/k)) = 1/e. □ 

Extensions. Given the previous result on cardinality constraints, 
several extensions have been considered, such as knapsack constraints 
or matroid constraints (see [23] and references therein). Moreover, 
fast algorithms and online data-dependent bounds can be further de- 
rived [Ml- 

8.3 Difference of submodular functions 

In regular continuous optimization, differences of convex functions 
play an important role, and appear in various disguises, such as DC- 
programming [67J, concave-convex procedures |138j . or majorization- 
minimization algorithms |69j • They allow the expression of any contin- 
uous optimization problem with natural descent algorithms based on 
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upper-bounding a concave function by its tangents. 

In the context of combinatorial optimization, |106j has shown that 
a similar situation holds for differences of submodular functions. We 
now review these properties. 



Formulation of any combinatorial optimization problem. Let 

F : 2^ — )• M any set-function, and H a strictly submodular function, 
i.e., a function such that 

Q = min min -H(Au{i,j}) + H(Au{i}) + H(Au{j})-H(A)>0. 

AcV i,jeV\A 

A typical example would be H{A) = — where a = 1. If 
/3 = min min -F(Au{i,j})+F(Au{i})+F(Au{j})-F(A) 

AcV i,jeV\A 

is non-negative, then F is submodular (see Prop. II. 2p . If /3 < 0, then 
F{A) - ^H{A) is submodular, and thus, we have F{A) = [F{A) - 
^H{A)] — [—^H{A)], which is a difference of two submodular func- 
tions. Thus any combinatorial optimization problem may be seen as a 
difference of submodular functions (with of course non-unique decom- 
position). However, some problems, such as subset selection in Sec- 
tion 13.71 or more generally discriminative learning of graphical model 
structure may naturally be seen as such |106j . 

Optimization algorithms. Given two submodular set-functions F 
and G, we consider the following iterative algorithm, starting from a 
subset A: 

(1) Compute modular lower-bound B i— s{B), of G which is 
tight at A: this might be done by using the greedy algorithm 
of Prop. [77^ with w = 1a_- Several orderings of components 
of w may be used (see [106j for more details). 

(2) Take A as any minimizer of -B i— )• F{B) — s{B), using any 
algorithm of Section [71 



It converges to a local minimum, in the sense that at convergence to a 
set A, all sets AU {k} and A\{k} have smaller function values. 
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B(F} B(Fl 

/ 





Fig. 8.1: Geometric interpretation of submodular function maximiza- 
tion (left) and optimization of differences of submodular functions 
(right). See text for details. 



Formulation using base polyhedron. We can give a similar ge- 
ometric interpretation than for submodular function maximization; 
given G and their Lovasz extensions /, 5, we have: 

min FiA) — G(A) = min min F(A) — s(A) because of Prop. 

AcV AcVs&B{G) 

= min min f(w) — s~^w because of Prop. [27 
min min f{w) — s~^w 



min min max t^w — s~^w 



seS(G) we[o,i]p 

= min min 

seB(G) we[0,l]P t€B{F) 

= min max min t^w — s~^w by strong duality, 

se-B(G) t&B{F) we[0,l]P 

= min max (t — s)^(V) 
seB(G)teB(F) 

F(V)-G(V) 1 
= mm max t — s h. 

2 2 si^B{G)t&B{F) 

Thus optimization of the difference of submodular functions may be 
seen as computing the Hausdorff distance (see, e.g., jlUl| ) between 
B{G) and B{F). See an illustration in Figure [87T1 
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Experiments 



In this section, we provide illustrations of the optimization algo- 
rithms described earlier, for submodular function minimization (Sec- 
tion [971]), as well as for convex optimization problems, quadratic sep- 
arable ones such as the ones used for proximal methods or within 
submodular function minimization (Section I9.2p . and an applica- 
tion of sparsity-inducing norms to wavelet-based estimators (Sec- 
tion 19. 3p . The Matlab code for all these experiments may be found 
at http : //www.di . ens . f r/~f bach/ submodular/, 

9.1 Submodular function minimization 

We compare several simple though effective approaches to submodular 
function minimization described in Section [71 namely: 

• min-norm-point: the minimum- norm-point algorithm to 
maximize — over s G B{F), described in Section [7721 

• subgrad-des: the projected gradient descent algorithm to 
minimize f{w) over w G [0, 1]^, described in Section [7.51 

• cond-grad: the conditional gradient algorithm to maximize 
— ^11 sill over s G B{F), with line search, described in Sec- 
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Fig. 9.1: Examples of semi-supervised clustering : (left) observations, 
(right) results of the semi-supervised clustering algorithm based on 
submodular function minimization, with eight labelled data points. 



tionO 

• cond-grad-l/t: the conditional gradient algorithm to max- 
imize — I II sill over s G B{F), with step size l/t, described in 
Section I7.5[ 

• cond-grad-w: the conditional gradient algorithm to maxi- 
mize — Diag(a)~^s over s S B{F), with line search. 

Prom all these algorithms, we look for the sub-level sets of s to obtain 
the best value for the set-function F. We also use the base s £ B{F) 
as a certificate for optimality, through F{A) — S-{V) (see Prop. [7?3|) . 
We test these algorithms on three data sets: 

• Two moons (clustering with mutual information criterion): 
we generated data from a standard synthetic examples in 
semi-supervised learning (see Pigure 19. ip with p = 400 data 
points, and 16 labelled data points, using the method pre- 
sented in Section 13.51 (based on the mutual information be- 
tween two Gaussian processes) , with a Gaussian- RBP kernel. 

• Genrmf-wide and Genrmf-long (min-cut/max-flow stan- 
dard benchmark): following [50j, we generated cut problem 
using the generator GENRMP available from DIMACS chal- 
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lengqj. Two types of network were generated, "long" and 
"wide" , with respectively p = 575 vertices and 2390 edges, 
and p = 430 and 1872 edges (see [50] for more details). 

In Figures 19. 2^ 19.41 and 19. 6| we compare the five algorithms on the 
three datasets. We denote by Opt the optimal value of the optimiza- 
tion problem, i.e.. Opt = min^g^p /(tt;) = maXg^^^^p-^ s-iV). On the 
left plots, we display the dual suboptimality, i.e, log;^o(Opt ~ 
together with the certified duality gap (in dashed). In the right plots 
we display the primal suboptimality logiQ{F{B) — Opt). Note that in 
all the plots in Figures [921 IS IM ESI IM] and EZl we plot the best 
values achieved so far, i.e., we make all curves non- increasing. 

Since all algorithms perform a sequence of greedy algorithms (for 
finding maximum weight bases), we replace running times by num- 
bers of iteration^. On all datasets, the achieved primal function val- 
ues are in fact much lower than the certified values, a situation com- 
mon in convex optimization, while this is not the case for dual val- 
ues. Thus primal values F{A) are quickly very good and iterations are 
just needed to sharpen the certificate of optimality. On all datasets, 
the min-norm-point algorithm achieved quickest small duality gaps. 
On all datasets, among the three conditional gradient algorithms, the 
weighted one (with weights = l/ofc) performs slightly better than 
the unweighted one, and these two versions with line-search perform 
significantly better than the algorithm with decaying step sizes. Finally, 
the direct approach based on subgradient descent performs worse in the 
two graph-cut examples, in particular in terms of certified duality gaps. 

9.2 Separable optimization problems 

In this section, we compare the iterative algorithms outlined in Sec- 
tion [6] for minimization on quadratic separable optimization problems, 
on the problems related to submodular function minimization from the 

^ The First DIMACS international algorithm implementa- 

tion challenge: The core experiments (1990), available at 
f tp: / /dimacs . rutgers . edu/pub/neti3ow/generalinf o/core . tex 

Only the mininum-norm-point algorithm has a non trivial cost per iteration, and in our 
experiments, plots with running times would not be significantly different. 
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Fig. 9.2: Submodular function minimization results for "Genrmf-wide" 
example: (left) optimal value minus dual function values in log-scale 
vs. number of iterations, in dashed, certified duality gap in log-scale 
vs. number of iteration. (Right) Primal function values minus optimal 
value in log-scale vs. number of iterations. 
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Fig. 9.3: Separable optimization problem for "Genrmf-wide" example. 
(Left) optimal value minus dual function values in log-scale vs. number 
of iterations, in dashed, certified duality gap in log-scale vs. number of 
iteration. (Right) Primal function values minus optimal value in log- 
scale vs. number of iterations, in dashed, before the "pool- adjacent- 
violator" correction. 



previous section (i.e., minimizing f{w) + ^\\w\\2). In Figures [9.3^ 19.51 
and 19. 71 we compare three algorithms on the three datasets, namely the 
mininum-norm-point algorithm, and two versions of conditional gradi- 
ent (with and without line search). On the left plots, we display the 
achieved quantity log;^o(/(^) + ^ll^lli ~ ™iii«eRp f{v) + ^11^^112) while in 
the right plots we display the logarithm of the certified duality gaps, for 
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Fig. 9.4: Submodular function minimization results for "Genrmf-long" 
example: (left) optimal value minus dual function values in log-scale 
vs. number of iterations, in dashed, certified duality gap in log-scale 
vs. number of iteration. (Right) Primal function values minus optimal 
value in log-scale vs. number of iterations. 
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Fig. 9.5: Separable optimization problem for "Genrmf-long" example. 
(Left) optimal value minus dual function values in log-scale vs. number 
of iterations, in dashed, certified duality gap in log-scale vs. number of 
iteration. (Right) Primal function values minus optimal value in log- 
scale vs. number of iterations, in dashed, before the "pool- adjacent- 
violator" correction. 



the same algorithms. Since all algorithms perform a sequence of greedy 
algorithms (for finding maximum weight bases), we replace running 
times by numbers of iterations. As in Section 19.11 on all datasets, the 
achieved primal function values are in fact much lower than the certi- 
fied values, a situation common in convex optimization. On all datasets, 
the min-norm-point algorithm achieved quickest small duality gaps. On 
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Fig. 9.6: Submodular function minimization results for "Two moons" 
example: (left) optimal value minus dual function values in log-scale 
vs. number of iterations, in dashed, certified duality gap in log-scale 
vs. number of iteration. (Right) Primal function values minus optimal 
value in log-scale vs. number of iterations. 
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Fig. 9.7: Separable optimization problem for "Two moons" example. 
(Left) optimal value minus dual function values in log-scale vs. number 
of iterations, in dashed, certified duality gap in log-scale vs. number of 
iteration. (Right) Primal function values minus optimal value in log- 
scale vs. number of iterations, in dashed, before the "pool- adjacent- 
violator" correction. 



all datasets, among the two conditional gradient algorithms, the ver- 
sion with line-search perform significantly better than the algorithm 
with decaying step sizes. Note also, that while the conditional gradi- 
ent algorithm is not finitely convergent, its performance is not much 
worse than the minimum-norm-point algorith, with smaller running 
time complexity per iteration. Moreover, as shown on the right plots, 
the "pool-adjacent- violator" correction is crucial in obtaining much im- 
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proved primal candidates. 

9.3 Regularized least-squares estimation 

In this section, we illustrate the use of the Lovasz extension in the 
context of sparsity-inducing norms detailed in Section 12.31 with the 
submodular function defined in Figure \'6.5\ which is based on a tree 
structure among the p variables, and encourages variables to be selected 
after their ancestors. We don't use any weights, and thus F{A) is equal 
to the cardinality of the union of all ancestors Anc(A) of nodes indexed 
by elements of A. 

Given a probability distribution (x, y) on [0, 1] x M, we aim to 
estimate f{x) = K{Y\X = x), by a piecewise constant function. 
Following ^140j . we consider a Haar wavelet estimator with max- 
imal depth d. That is, given the Haar wavelet, defined on M as 
ip{t) = l[o,i/2)(0 ~ l[i/2,i)(*)) consider the functions ipij{t) defined 
as il^ijit) = i{2'-^t- j), for i = l,...,d and j G {0, . . . , 2*-^ - 1}, 
leading to p = 2"^ — 1 basis functions. These functions come naturally 
in a binary tree structure, as shown in Figure 19.81 for d = 3. Impos- 
ing a tree-structured prior enforces that a wavelet with given support 
is selected only after all larger supports are selected; this avoids the 
selection of isolated wavelets with small supports. 

We consider random inputs Xj G [0, 1], i = 1, . . . , n, from a uniform 
distribution and compute yi = sin(207rx?) -|- where Si is Gaussian 
with mean zero and standard deviation 0.1. We consider the optimiza- 
tion problem 



where 6 is a constant term and R{w) is a regularization function. In 
Figure 19.91 we compare several regularization terms, namely R{w) = 



defined from the hierarchical submodular function F{A) = 
Card(Anc(A)). For all of these, we select A such that the generalization 
performance is maximized, and compare the estimated functions. The 
hierarchical prior leads to a lower estimation error with fewer artefacts. 




tt;||2 (ridge regression), R{w) 



w\\i (Lasso) and R{w) = Cl{w) 
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Fig. 9.8: Wavelet bynary tree {d = 3). See text for details. 
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Fig. 9.9: Estimation with wavelet trees: (left) ridge regression, (middle) 
Lasso, (right) hierarchical penalty. See text for details. 



In this section, our goal is mainly to compare several optimization 
schemes to minimize Eq. (19. ip for this particular example (for more sim- 
ulations on other examples with similar conclusions, see [SI [Ml EH E] ) • 
We compare in Figure [9. 101 three ways of computing the proximal oper- 
ator (within a proximal gradient method) and one direct optimization 
scheme based on subgradient descent: 

• Pr ox- hierarchical: we use a dedicated proximal operator 
based on the composition of local proximal operators ^77j . 
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Fig. 9.10: Running times for convex optimization for a regularized prob- 
lem: several methods are compared; see text for details. 



• Prox-decomposition: we use the algorithm of Section 16.11 
which uses the fact that for any vector t, F — t may be min- 
imized by dynamic programming [77] . 

• Prox-min-norm-point: we use the generic method which 
does not use any of the structure. 

• subgrad-descent: we use a generic method which does not 
use any of the structure, and minimize directly Eq. (19. ip by 
subgdradient descent. 

As expected, in Figure [9.101 we see that the most efficient algorithm 
is the dedicated proximal algorithm (which is usually not available ex- 
cept in particular cases like the tree-structured norm) , while the meth- 
ods based on submodular functions fare correctly, with an advantage 
for methods using the structure (i.e., the decomposition method, which 
is only applicable when submodular function minimization is efficient) 
over the generic method based on the min-norm-point algorithm (which 
is always applicable). 



Conclusion 



In this paper , we have explored various properties and apphcations of 
submodular functions. Key concepts are the Lovasz extension and the 
associated submodular and base polyhedra. Given the numerous ex- 
amples involving such functions, the analysis and algorithms presented 
in this paper allow the unification of several results in convex opti- 
mization, involving structured situations and notably sparsity-inducing 
norms. Several questions related to submodular functions remain open, 
such as efficient combinatorial optimization algorithms for submodu- 
lar function minimization, with both good computational complexity 
bounds and practical performance. Moreover, we have presented algo- 
rithms for approximate submodular function minimization with con- 
vergence rate of the form 0{l/^/t) where t is the number of calls to 
the greedy algorithm; it would be interesting to obtain better rates or 
show that this rate is optimal. Finally, submodular functions essen- 
tially consider links between combinatorial optimization problems and 
linear programming, or linearly constrained quadratic programming; 
it would be interesting to extend submodular analysis so that more 
modern convex optimization tools such as semidefinite programming. 
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A 



Review of convex analysis and optimization 



In this section, we review relevant concepts from convex analysis and 
optimization. For more details, see [17 1 fl^ l [T6 t 1121] , 

A.l Convex analysis 

In this section, we review extended-value convex functions, Fenchel 
conjugates and polar sets. 

Extended-value convex functions. In this paper, we consider 
functions defined on with values in M U {+00}, and the domain 
of / is defined to be the set of vectors in such that / has finite 
values. Such an "extended-value" function is said to be convex if its 
domain is convex and / restricted to its domain (which is a real- valued 
function) is convex. 

Throughout this paper, we denote w 1— )• Ic{w) the indicator function 
of the convex set C, defined as for w £ C and +00 otherwise; this 
defines a convex function and allows constrained optimization problems 
to be treated as unconstrained optimization problems. In this paper, we 
always assume that / is a proper function (i.e., has non-empty domain). 
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A function is said closed if for all a G M, the set {w £ M^, f{w) ^ a} 
is a closed set. We only consider closed functions in this paper. 

Fenchel conjugate. For any function / : — )■ M U {+00}, we may 
define the Fenchel conjugate f* as the extended-value function from 
MP to M U {+00} defined as 

r{s)= sup w'^s- f{w). (A.l) 

As a pointwise supremum of linear functions, /* is always convex 
(even if / is not), and it is always closed. Moreover, if / is convex and 
closed, then the biconjugate of / (i.e., /**) is equal to /, i.e., for all 

w £ W, 

f{w) = sup w'^s - f*{s). 

If / is not convex and closed, then the bi-conjugate is always a lower- 
bound on /, i.e., for all w G MP, f**{w) ^ f{w), and it is the tightest 
such convex closed lower bound, often referred to as the convex envelope 
(see an example in Section 12. 3p . 

When / is convex and closed, many properties of / may be seen 
from /* and vice-versa: 

• / is strictly convex if and only if /* is differentiable in the 
interior of its domain, 

• / is //-strongly convex (i.e., the function w 1— )■ f{w) — ^\\w\\2 
is convex) if and only if /* has Lipschitz-continuous gradients 
(with constant 1 //i) in the interior of its domain. 

Support function. Given a convex closed set C, the support func- 
tion of C is the Fenchel conjugate of Ic, defined as: 

Vs G W, Icis) = supu;^s. 

wee 

It is always a positively homogeneous proper closed convex function. 
Moreover, if / is a positively homogeneous proper closed convex func- 
tion, then /* is the indicator function of a closed convex set. 
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Proximal problems and duality. In this paper, we will consider 
minimization problems of the form 

where / is a positively homogeneous proper closed convex function 
(with C being a convex closed set such that /* = /c). We then have 

min —\\w — z\\o + f(w) = minmax — Ww — zWo+w'^s 
lueiRp 2 wGRp sec 2 

■ 1„ „2 T 

= max mm — — z 9 + w s 

sec weRp 2 

1|| Il2 1„ „2 
= max — z 9 s — z 9, 

sec 2" "2 2" 

where the unique minima of the two problems are related through w = 
s — z. Note that the inversion of the maximum and minimum were 
made possible because strong duality holds in this situation (/ has 
domain equal to MP). Thus the original problem is equivalent to an 
orthogonal projection on C. See applications and extensions to more 
general separable functions (beyond quadratic) in Section El 



Polar sets. Given a subset C of MP, the polar set of C is denoted C° 
and defined as: 

C° = {se MP, yw e C,w'^s ^ 1}. 

For any C, the polar set C° is a closed convex set that contains zero 
in its interior. If C satifies itself these properties, then C°° = C (more 
generally, C°° is the closure of the convex hull of C U {0}). Thus, the 
polar operation is a bijection between polar convex sets that contain 
zero in their interior. 

Given a set C, the support function / of C (i.e., the Fenchel conju- 
gate of Ic) is such that C° is the set {w G MP, f{w) ^ 1}. In the context 
of norms, i.e., when / is a norm, then C is the unit ball of the dual 
norm (the dual norm of /, is equal to Fenchel-conjugate of the indicator 
function of its unit ball, to be distinguished from the Fenchel-conjugate 
of /); the two unit balls are then polar to each other. 
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A. 2 Convex optimization 

In this section, we consider several iterative optimization algorithms 
dedicated to minimizing a convex function / defined on W (potentially 
with infinite values) . See also Section 15.11 for a quick review of proximal 
methods. 



Subgradient descent. A subgradient of a convex function f at x E 
MP, is any vector g such that for all y G MP, f{y) ^ f{x) + {y — x). If 
we assume that / is Lipschitz-continuous (with Lipschitz constant B) 
on the ^2-ball of radius D (which is assume to be included in the interior 
of the domain of /), then the subgradient descent algorithm consists 
of (a) starting from any xq such that ||xo||2 ^ D and (b) iterating the 
recursion 

xt = Uoixt-i - Jtgt-i), 

where gt-i is any subgradient of / at xt-i (with our assumption, such 
gt always exists), and 11 the orthogonal projection on the ^2-ball of 
center zero and radius D. 

If we denote /* = min||^.||2^£) f{x), then with jt = "g^i have for 
all t > 0, the convergence rates 

0^ min f(xu) - f* i^^^. 

The following proposition shows that we may also get a certificate 
of optimality with similar guarantee. 

Proposition A.l. Let / be a convex function / defined on M". We 
assume that / is Lipschitz-continuous on K (with diameter D), with 
constant B. Let xt be the t-th iterate of subgradient descent with con- 
stants 7i = and yt = j EL=o5«- Then 

^ min f{xu) -r^r + r{yt) + max -yjx ^ 

u6{0,...,t} xeK y/t 



Proof. Let /* be the Fenchel conjugate of /, defined as f*{y) = 
maXj^gRn x~^y — f{x). We denote by g the support function of K, i.e., 
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g{y) = vn&yixeK x~^y. We then have 

min fix) = minmaxx^ y — f* (y) = max — f* (y) — q(—y). 

We consider the non-negative real number gap(,T, y) = f{x) + f*{y) + 
j/).We consider the following subgradient descent iteration 

xt = Uxixt-i - jtVt-i) with yt-i € df{xt-i), 

where Hk is the orthogonal projection on K and df{xt-i) is the sub- 
differential of / at xt-i- 

Following standard arguments, we get for any x ^ K (using the 
contractivity of orthogonal projections): 

||a;t-x|p ^ \\xt-i - xf + -i'l\\yt-i\\'^ - 2-it{xt-i - x)^ yt-u 
leading to 

/ ^\\xt-i-xf -\\xt-x\\^ + -flB'^ 

[xt-i - x) yt-1 ^ . 

Thus, summing from t = 1 to T, we obtain (by summing by parts): 

T ^2 T ^ T 



^{xt-i - x)^ yt-1 ^ ^^jt + ^^lt^iW^t-i-xf -\\xt-x\\' 

t=l t=l t=l 

t=i t=i 

+^7r^lko - xf - ^7t^\\xt - xf. 

If we further assume that jt is non-increasing and that D = diam(iC ) , 
we get 

t=l t=l t=l 



This leads to, using f{x) ^ f{xt-\) + yj-iix - xt-i), 

1 ^ B^ ^ D"^ 

^ ^ [f{xt-i) - fix)] ^ — ^ 7, + — - 
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We may now apply this to x* any minimizer of / on iC, to get the 
two usual bounds 



min f(xt-i)—f(x*) ^ — = > 7t H — — — • 



We now denote xt = \ I]* =o ^" a'^d yt = j Z^* =o We have 



/*(yT) + 5'(-yr) 
1 ^ 

^ T X] /*(yt-i) + f (-yr) by convexity of / 



t=\ 

T 



^ J] [ - /(a^t-i) + a;7_iyt-i] + 5(-yT) 

because xt-i,yt-i are Fenchel-dual 



= X] + X] ^^-1^*-! ~ f^"^ ^ certain x G K 

^ r 1 



t=i t=i 



t=i 



This leads to 



gap(xr, yr) = /{xt) + /*(j/t) + gi-yx) ^-^^lt + 



2r ^ ■ 2T7T 
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With 7t = we obtain an upper bound 



— 2^7i + -^ ^ 



t=i 



2T-fT 



2 
DB 

DB 
DB^f2 



T 



a- 



+ -Vf 

T a 



1 



^ with a = — ;=. 



If we further assume that / is strongly convex with constant [i^ (i.e., 
X I— >• /(x) — ^||x||2 is convex), then by taking 7t = we have, for ah 
t > [HZ], 

0< ,„i„ /(.„)- /-(^ilieil. 

ue{o,...,t} 2^t t 

Conditional gradient descent. We now assume that the function 
/ is differentiable on a compact convex set K gW (with diameter D), 
and that its gradient is Lipschitz-continuous with constant L. We con- 
sider the following conditional gradient algorithm, which is applicable 
when linear functions may be maximized efficiently over K. 

(1) Initialization: Choose any xq £ K, compute a minimizer 
xi e K /'(xo)^x. 

(2) Iteration: iterate until upper bound e on duality gap is 
reached: 

(a) xt-i G argmin^g;^ /'(xt_i)^rc, 

(b) Compute upper bound on gap: (xf^i — xt-i)' f (xt-i) 

(c) Compute CJt-i = min 1 1, 

(d) Take xt = xt-i + iOt-i{xt-i - xt-i). 
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Step (2) (a) corresponds to minimizing the first order Taylor expansion 
at xt-i, while step (2)(c) corresponds to performing approximate line- 
search on the segment [xt-i,xt-i]. Combining the analysis of [40] and 
[72] . we have the following proposition: 



Proposition A. 2. For the previous algorithm with have: f{xt) — 
minxfzK f{x) ^ Moreover, there exists at least one k G [2t/3,t] 

such that max^.gi<-(xfc — x)~^ f'{xk) ^ "^^^ , i.e., the primal-dual pair 
is a certificate of optimality ensuring at least an approxi- 
mate optimality of ^^J? . 



Proof. Let g{z) = maXx^K {z — x)^ f (z) . It is a certificate of duality for 
z £ K. We denote = f{xt) — min^g/^- f{x). We have ^ ^ g{xt)- 
Moreover, following [IQ], we have Ai ^ and 

At ^ /\t^i + f {xt-iY {xt - xt-i) + ^\\xt - xt-i\\l 

= At_i + ut-ifixt-iY {xt-i - xt-i) H ^||xt_i - xt^iWl 



2 



^ At-i - LJt-ig{xt-i) + ■ 



2 

2 



1 . f g(xt-l)^ , ,\ 

^ At_i--mm| ,g{xt-i)'i 



This implies that At is non-increasing, and thus At ^ Ai ^ This 
implies, using At_i ^ gixt-i): 

A.^At_,-^AL,. 
By dividing by AtAt_i, we get: 

1 



2LLI2' 

and thus jjyj ^ A^^ ^ A^^ — ; which implies for any t ^ 1, 

At ^ ^ . 

t + 3 t 
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Let us now assume that for all u G {at, . . . ,t}, then g{xu) ^ 
We then have 

* ^ at + 3 2~ ^ (at + 3)2 

2LD'^ (3'^LD'H{1 - a) 
^ at + 3 2{at + 3)^ 



With a = 2/3 and /3 = 3, we obtain that < 0, which is a contradic- 
tion. This leads to the desired result. □ 



B 



Miscellaneous results on submodular functions 



B.l Conjugate functions 

The next proposition computes the Fenchel conjugate of the Lovasz 
extensions restricted to [0, 1]^, noting that by Prop. 14.11 the regular 
Fenchel conjugate of the unrestricted Lovasz extension is the indicator 
function of the base polyhedron (for a definition of Fenchel conjugates, 
see [171 116j and Appendix |A]). This allows a form of conjugacy between 
set-functions and convex functions (see more details in [49j). 

Proposition B.l. (Conjugate of a submodular function) Let F 

be a submodular function such that F{0) = 0. The conjugate / : 
RP ^ M of F is defined as /(s) = maxAcys(^) - F{A). Then, the 
conjugate function / is convex, and is equal to the Fenchel-conjugate 
of the Lovasz extension restricted to [0, 1]^. Moreover, for all A G V, 
F{A) = max.eMP s{A) - f{s). 



Proof. The function / is a maximum of linear functions and thus it is 
convex. We have for s G M^: 

max ui^ s — f(w) = maxs(A) — F(A) = f(s), 

we[o,i]p Acv 



115 



116 Miscellaneous results on submodular functions 



because F — s is submodular and because of Prop. 12.41 which leads to 
first the desired result. The last assertion is a direct consequence of the 
fact that = /(1a). □ 

B.2 Operations that preserve submodularity 

In this section, we present several ways of building submodular func- 
tions from existing ones. For all of these, we describe how the Lovasz 
extensions and the submodular polyhedra are affected. Note that in 
many cases, operations are simpler in terms of submodular and base 
polyhedra. Many operations such as projections onto subspaces may be 
interpreted in terms of polyhedra corresponding to other submodular 
functions. 

We have seen in Section 13.51 that given any submodular function 
F, we may define G{A) = F{A) + F{V\A) - F{V). Then G is always 
submodular and symmetric (and thus non-negative, see Section l7.4p . 
This symmetrization can be applied to any submodular function and in 
the example of Section [3l they often lead to interesting new functions. 
We now present other operations that preserve submodularity. 

Proposition B.2. (Restriction of a submodular function) let 

F be a submodular function such that F{0) = and A C V. The 
restriction of F on A, denoted Fa is a set-function on A defined as 
Fa{B) = F{B) for B d A. The function Ja is submodular. Moreover, 
if we can write the Lovasz extension of F as f{w) = f{wA-,wv\A)^ 
then the Lovasz extension of Fa is Ja^wa) = fiwA-,^)- Moreover, the 
submodular polyhedron P{Fa) is simply the projection of P{F) on the 
components indexed by A, i.e., s £ P{Fa) if and only if 3t such that 
{s,t)eP{F). 

Proof. Submodularity and the form of the Lovasz extension are 
straightforward from definitions. To obtain the submodular poly- 
hderon, notice that we have fA{wA) = f{wA, 0) = max^^^^^gp^^) w\s + 
O^t, which implies the desired result, this shows that the Fenchel- 
conjugate of the Lovasz extensions is the indicator function of a poly- 
hedron. □ 
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Proposition B.3. (Contraction of a submodular function) let 

F he a submodular function such that F{0) = and A G V. The 
contraction of F on A, denoted F^ is a set-function on V\A defined 
as = F{A U B) - F{A) for B C V\A. The function is 

submodular. Moreover, if we can write the Lovasz extension of F as 
f{w) = f{wA,WY\j^), then the Lovasz extension of F^ is f^iwy^j^) = 
f{lA,wv\A) ~ ^{^)- Moreover, the submodular polyhedron P{F^) is 
simply the projection of P{F) n {s{A) = F{A)} on the components 
indexed by V\A, i.e., t e P(F^) if and only if 3s E P{F) n {s{A) = 
F{A)}, such that Sy^^ = 



Proof. Submodularity and the form of the Lovasz extension are 
straightforward from definitions. Let t G RI^^^L If 3s G P{F)ri{s{A) = 
F{A)}, such that sy\A = t, then we have for all B C V\A, t{B) = 
t{B) + s{A) - F{A) ^ F{A U B) - F{A), and hence t G P(F^). If 
t G P{F^), then take any v G B(Fa) and concatenate v and t into 
s. Then, for ah subsets C C V, s{C) = s{C n A) + s{C n {V\A)) = 

v{C nA) + t{c n {V\A)) ^ F(c n A) + u (C n (y\^))) - f(^) = 

F{CnA)+F{AuC)-F{A) ^ F(C) by submodularity. Hence s G 

□ 

The next proposition shows how to build a new submodular func- 
tion from an existing one, by partial minimization. Note the similarity 
(and the difference) between the submodular polyhedra for a partial 
minimum (Prop. IB.4p and for the restriction defined in Prop. IB.2[ 

Note also that contrary to convex functions, the pointwise maximum 
of two submodular functions is not in general submodular (as can be 
seen by considering functions of the cardinality from Section 13. 1|) . 



Proposition B.4. (Partial minimum of a submodular function) 

We consider a submodular function G on V L) W, where V CiW = 
(and \W\ = q), with Lovasz extension g : W^'' — )• R. We consider, for 
AcV, F{A) = minBcW G{A U B) - miiiBcW G{B). The set-function 
F is submodular and such that F(0) = 0. Its Lovasz extension is such 
that for all w G [0,1]^, /{w) = mmy^^Q ijq g{w,v) — min^^^Q ^q g{0,v). 
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Moreover, if minBcW G{B) = 0, we have for all w G M^, f{w) = 
m.m.^^-^q^g{w,v)^ and the submodular polyhedron P{F) is the set of 
s G M^' such that there exists t £ R\, such that {s,t) G P{G). 



Proof. Define c = miuBcW G{B), which is independent of A. We have, 
for A, A' C V, and any B, B' C W, by definition of F: 

F{AuA')+F{AnA') 
^ -2c + G{[A U A'] U[BU B']) + G{[A n A'] U [B n B']) 
= -2c + G{[A UB]U [A' U B']) + G{[A UB]n [A' U B']) 
^ -2c + G{A LIB) + G{A' U B') by submodularity. 

Minimizing with respect to B and B' leads to the submodularity of F. 

Following Prop. IB.H we can get the conjugate function / from the 
one g of G. For s £ W, we have, by definition, f{s) = maxAcV s{A) — 
F{A) = maxAuBcVuw s{A) + c - G{A U B) = c + g{s, 0). We thus get 
from Prop. [BTT] that for w G [0, 1]^, 

f(w) = maxw~^ s — f(s) 



max ^'''s — 0) — c 

T, 



max min w s — w s + q(w, v) 

{w,v)e[o,i]p+i 



by applying Prop. IB. 11 

= min max s — s + g(w,v) — c 
{w,v)G[0,l]P+i seRp 

= min g{w,v) — c by maximizing with respect to s. 

Note that c = min^cH^ G{B) = min^g^^i]? g{0,v). 

For any w £ M^j., for any A ^ ||w||oo) we have w/X £ [0,1]^, and 
thus 

f(w) = \f(w/X)= min \g(w/X,v) — cX = min g{w,Xv) — cX 
De[o,i]9 De[o,i]9 

= min q{w, v) — cA. 
Thus, if c = 0, we have f{w) = mm^^-^q g^Wjv), by letting A — )• +oo. 
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We then also have: 

f(w) = mill q(w,v) = min max w^s + v'^t 
vml vml (s,t)GP(G) 

= max w~^s. 

(s,t)eP{G), teR\ 

□ 



The following propositions give an interpretation of the intersec- 
tion between the submodular polyhedron and sets of the form {s ^ z} 
and {s ^ z}. Prop. IB.5I notablv implies that for all z G W, we have: 
miuBcV F{B) + z{V\B) = maxg^p(^p-^^ s^z^iV), which implies the sec- 
ond statement of Prop. [73] for z = 0. 

Proposition B.5. (Convolution of a submodular function and 
a modular function) Let be a submodular function such that 
F{0) = and z e M.P. Define G{A) = mmBcAF{B) + z{A\B). Then 
G is submodular, satisfies G{0) = 0, and the submodular polyhedron 
P{G) is equal to P{F)n{s ^ z}. Moreover, for all AcV, G{A) ^ F{A) 
and G{A) ^ z{A). 



Proof. Let A, A' C V, and B, B' the corresponding minimizers defining 
G{A) and G{A'). We have: 

G{A) + G{A') 
= F{B) + z{A\B) + F{B') + z{^\B') 

^ F{B U B') + F{B n B') + z{A\B) + z{A'\B') by submodularity, 

= F{B U B') + F{B n B') + z{[A U A']\[B U B']) + z{[An A']\[B n B']) 

^ G{A U A') + G{A n A') by definition of G, 

hence the submodularity of G.U s e P{G), then V5 C ^ C s{A) ^ 
G(^) ^ F(5) + z{A\B). Taking 5 = ^, we get that s G P(F); from 
5 = 0, we get s ^ and hence s G P(F)n{s ^ z}. If s G P(F)n{s ^ 
z}, for all yBcAcV, s[A) = s{A\B) + s{B) ^ z(^\P) + F[B); by 
minimizing with respect to B, we get that s G P{G). 

We get ^ by taking B = A \n the definition of G(^), 

and we get G{A) ^ z{A) by taking B = 0. □ 
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Proposition B.6. (Monotonization of a submodular function) 

Let F be a submodular function such that F{0) = 0. Define G{A) = 
mmBz)AF{B) — mmscV Fi^). Then G is submodular such that 
G{0) = 0, and the base polyhedron B{G) is equal to B{F) n {s ^ 0}. 
Moreover, G is non-decreasing, and for all A C V, G{A) ^ F{A). 



Proof. Let c = miuBcV Fi^). Let A, A' C V, and B,B' the corre- 
sponding minimizers defining G{A) and G{A'). We have: 

G{A) + G{A') = F{B) + F{B') ~ 2c 

^ F{B U B') + F{B n B') - 2c by submodularity 
^ G{A U A') + G{A n yl') by definition of G, 

hence the submodularity of G. It is obviously non-decreasing. We get 
G{A) ^ F{A) by taking B = Am the definition of G{A). Since G is 
increasing, B{G) C (because all of its extreme points, obtained by 
the greedy algorithm, are in M^). By definition of G, B{G) C B{F). 
Thus B{G) C B{F) n M^. The opposite inclusion is trivial from the 
definition. 

□ 
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